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PREFACE 

For some years it has been the privilege of the writer to give 
lectures on the Calculus of Probability to supplement courses of 
lectures by others on elementary statistical methods. The basis 
of all statistical methods is probability theory, but the teacher of 
mathematical statistics is concerned more with the application 
of fundamental probability theorems than with their proof. It 
is thus a convenience for both teachers and writers of textbooks 
on statistical methods to assume the proof of certain theorems, 
or at least to direct the student to a place where their proof may 
be found, in order that there shall be the minimum divergence 
from the main theme. 

This treatise sets out to state and prove in elementary mathe- 
matical language those propositions and theorems of the calculus 
of probability which have been found useful for students of 
elementary statistics. It is not intended as a comprehensive 
treatise for the mathematics graduate; the reader has been 
envisaged as a student with Inter. B.Sc. mathematics who wishes 
to teach himself statistical methods and who is desirous of 
supplementing his reading. With this end in view the mathe- 
matical argument has often been set out very fully and it has 
always been kept as simple as possible. Such theorems as do not 
appear to have a direct application in statistics have not been 
considered and an attempt has been made at each and every 
stage to give practical examples. In a few cases, towards the end 
of the book, when it has been thought that a rigorous proof of 
a theorem would be beyond the scope of the reader's mathematics, 
I have been content to state the theorem and to leave it at 
that. 

The student is to be pardoned if he obtains from the elementary 
algebra textbooks the idea that workers in the probability field 
are concerned entirely with the laying of odds, the tossing of dice 
or halfpennies, or the placing of persons at a dinner table. All 
these are undoubtedly useful in everyday life as occasion arises 
but they are rarely encountered in statistical practice. Hence, 
while I have not scrupled to use these illustrations in my turn, as 
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soon as possible I have tried to give examples which might be met 
with in any piece of statistical analysis. 

There is nothing new under the sun and although the elemen- 
tary calculus of probability has extended vastly in mathematical 
rigour it has not advanced much in scope since the publication 
of Theorie des Probability by Laplace in 1812. The serious 
student who wishes to extend his reading beyond the range of 
this present book could do worse than to plod his way patiently 
through this monumental work. By so doing he will find how 
much that is thought of as modern had already been treated in 
a very general way by Laplace. 

It is a pleasure to acknowledge my indebtedness to my 
colleague, Mr N. L. Johnson, who read the manuscript of this 
book and who made many useful suggestions. I must thank 
my colleague, Mrs M. Merrington for help in proofreading, and 
the University Press, Cambridge, for the uniform excellence of 
their type setting. Old students of this department cannot but 
be aware that many of the ideas expressed here have been derived 
from my teacher and one-time colleague, Professor J. Neyman, 
now of the University of California. It has been impossible to 
make full acknowledgement and it is to him therefore that I 
would dedicate this book. Nevertheless, just as probability is, 
ultimately, the expression of the result of a complex of many 
factors on one's own mind, so this book represents the synthesis 
of different and often opposing ideas. In brief, while many people 
have given me ideas the interpretation and possible distortion 
of them are peculiarly mine. 

F. N. 

DEPARTMENT OP STATISTICS 
UNIVERSITY COLLEGE, LONDON 
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CHAPTER I 
FUNDAMENTAL IDEAS 

It has become customary in recent years to open expositions 
of probability or statistical theory by setting out those philo- 
sophical notions which appear to the author to underlie the 
foundations of his mathematical argument. Unfortunately ' as 
many men, so many opinions ' and the unhappy student of the 
subject is often left bogged in the mire of philosophical dis- 
quisitions which do not lead to any satisfactory conclusion and 
which are not essential for the actual development of the theory. 
This does not imply, however, that they are not necessary. It is 
true that it is possible to build up a mathematical theory of 
probability which can be sufficient in itself and in which a 
probability need only be represented by a symbol. If the building 
of such a framework were all that was required then speculations 
and theories would be unprofitable, for there can be no reality in 
mathematical theory except in so far as it is related to the real 
world by means of its premises and its conclusions. Since the 
theory of probability attempts to express something about the 
real world it is clear that mathematics alone are not sufficient 
and the student needs must try to understand what is the 
meaning and purpose of the logical processes through which his 
mathematical theory leads him; for the anomalous position 
obtains to-day in which there are many points of view of how to 
define a probability, and as many more interpretations of the 
results of applying probability theory to observational data, but 
the actual calculations concerned are agreed on by all. 

Fundamentally the term probable can only apply to the state 
of mind of the person who uses the word. To make the statement 
that an event is probable is to express the result of the impact of 
a complex of factors on one's own mind, and the word probable 
in this case will mean something different for each particular 
individual; whether the statement is made as a result of numerical 
calculation or as a result of a number of vague general impressions 
is immaterial. The mathematical theory of probability is con- 
cerned, however, with building a bridge, however inadequate it 
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2 Probability Theory for Statistical Methods 

may seem, between the sharply defined but artificial country of 
mathematical logic and the nebulous shadowy country of what 
is often termed the real world. And, descriptive theory being at 
present in a comparatively undeveloped state, it does not seem 
possible to measure probability in terms of the strength of the 
expectation of the individual. Hence, while the student should 
never lose sight of the fact that his interpretation of figures will 
undoubtedly be coloured by his own personal impressions and 
prejudices, we shall restrict the meanings of probable and prob- 
ability to the admittedly narrow path of numbers, and we shall 
consider that the work of the probabilist is complete when a 
numerical conclusion has been reached. We shall thus transfer 
a probability from the subjective to the objective field. 

In making such a transference, however, we do not escape the 
main question 'What do we mean by a probability? 5 although 
we possibly make it a little easier to answer. In the field of 
statistical analysis there would seem to be two definitions which 
are most often used, neither of which is logically satisfying. The 
first of these theories we may call the mathematical theory of 
arrangements and the second the frequency theory. Since it 
is by the light of these two theories that probabilities are 
generally interpreted, it will be useful to consider each of these 
in a little detail. The mathematical theory of arrangements 
possibly is as old as gaming and cardplaying; certainly the idea 
of probability defined in such a way was no secret to Abram de 
Moivre (Doctrine of Chances, 1718), and it is in fact the definition 
which everyone would tend automatically to make. For example, 
suppose that the probability of throwing a six with an ordinary 
six-sided die is required. It is a natural proceeding to state that 
there are six sides, that there is a six stamped on only one side 
and that the probability of throwing a six is therefore 1/6. 

The probability of an event happening is, in a general way, 
then, the ratio of the number of ways in which the event may 
happen, divided by the total number of ways in which the event 
may or may not happen. As a further illustration we may con- 
sider the throwing of a coin into the air. When the coin falls there 
are two possible alternatives, the head may be uppermost or the 
tail. If the probability of throwing a head is required then out of 
the two possible alternatives there is only one way in which 
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a head may be obtained and the required probability, according 
to this theory, is therefore . 

So far there is perhaps little need for comment. A mathematical 
set of arrangements & is postulated; from the setfl the subset o) 
of arrangements in which the event may happen is picked out, 
and the ratio of the subset to the complete set of arrangements is 
the defined probability. This probability is exact in the mathe- 
matical sense, just as is the constant TT, but it is without meaning. 
The statistician now takes over and states that in saying the 
probability of an event is p, where p is the ratio of favourable to 
total arrangements, it will mean that if the experiment is carried 
out on a very large number of occasions, under exactly similar 
conditions, then the ratio of the number of times on which the 
event actually happened to the total number of times the trial 
was made will be approximately equal to p, and this ratio 
will tend more closely to be p as the number of trials is 
increased. 

It is the definition of probability by mathematical arrange- 
ments and its justification by means of repeated trials which we 
usually accept for lack of anything better, but it is well to realize 
that the interpretation of probability defined in this way is open 
to objection. We have said that probability theory is an attempt 
to bridge the gap which lies between mathematical theory and 
observational reality. It follows therefore that any justification 
of the theory should be based on this so-called reality whereas it 
quite obviously is not. In no series of trials in the so-called real 
world can experiments be made under exactly similar conditions. 
If the trials are made by one person then they must differ in at 
least one aspect, time, and if they are carried out at the same 
instant of time then they must differ in that they are performed 
by entirely different people. It is certain that the conditions as 
stated can never obtain in the real world and that in a strict 
logical sense therefore the bridge between theory and practice 
seems impossible to construct. It must, however, be stated that 
in practice, provided care is taken in the experimental conditions, 
and provided the number of experiments is not large enough for 
the effect of wear in the experimental apparatus to become 
apparent, the ratio of the number of successes to the total number 
of trials does approximate to that stated in the mathematical 
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4 Probability Theory for Statistical Methods 

model, and it is considered that this is a justification of the 
mathematical framework. 

The writer feels, however, that the mathematical model should 
be subjected to closer scrutiny. Suppose that we imagine a 
mathematician who has sp^nt all his life in one room and who 
has had no opportunity to observe the real world outside. Such a 
mathematician could build a mathematical theory of probability 
just as could another mathematician who had contact with reality, 
and he would be able to state the probability of throwing a six with 
a die or of a head with a halfpenny simply by idealizing the die or 
halfpenny which had been described to him . For any mathematical 
argument and conclusion follows logically from the premises on 
which it is based, and both our mathematicians' premises will be 
based on the simplification of the die or the halfpenny. 

This much is certain, but what is not so certain is that these 
premises would be the same for the two persons. The imaginary 
mathematician might postulate a set of arrangements in which 
a weight of one was given to one, a weight of two to two, and so 
on making a total set of 2 1 . From this he could go on to postulate 
that the probability of a six was 6/21, and within the narrow 
framework of his own postulates he would be correct. Or he might 
postulate three cases for the halfpenny, one for heads, one for 
tails, and one for the occasion on which the halfpenny stands 
upright. Again, whatever conclusions he drew regarding the 
probability of heads would be correct with reference to the set of 
arrangements which he had postulated. 

The mathematician of the real world would certainly act 
differently in that he would have no hesitation in stating that the 
total number of arrangements for the die would be 6 and for the 
halfpenny 2; but why would he do so? It could only be because 
previous experience had taught him, either from a study of 
applied mathematics or from gaming itself, that one side of the 
die was as likely to turn up as any other. In other words, un- 
consciously perhaps, he would choose for his fundamental set of 
arrangements that set in which the alternatives were equally 
probable, and in so doing he would be guilty of circular reasoning. 

That the mathematical theory of arrangements will always lead 
to such an impasse appears more or less inevitable. Suppose for 
example that the real mathematician is confronted with a 
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problem in which he had no means of being able to foretell the 
answer, a state of affairs which is more frequent than not in 
statistical practice. He must therefore build his fundamental set 
of arrangements as best he may, but since the acid test of theory 
is experiment he would undoubtedly carry out experiments to 
test the adequacy of his mathematical set-up. Now if experiment 
showed that one of his alternatives occurs with twice or thrice the 
frequency which he had allowed, he would then alter his funda- 
mental set so that his theoretical probability and the probability 
as estimated from practice were more or less in agreement. And, 
in so doing, he would again be arguing in a circular way and his 
theoretical definition would have little validity when applied to 
a practical problem. 

The writer would suggest therefore that although the mathe- 
matical theory of arrangements is exact on the theoretical side, 
it is inadequate when the link between theory and practice is 
attempted and that the stumbling block of circular reasoning 
which lay in the path of Laplace and subsequent nineteenth- 
century writers has not really been eliminated. Some probabilists 
have shown themselves to be aware of this and have attempted 
a definition of probability not very different from that which we 
have already given as the connecting bridge between the mathe- 
matical theory of arrangements and observation. The frequency 
definition is commonly given as follows: c If in a series of inde- 
pendent and absolutely identical trials of number n the event E 
is found to occur on m occasions, the probability of E happening 
is defined as the limit of the ratio m/n as n becomes very large.' 
We have already noted the objections which might be raised 
against this definition in that the conditions ' absolutely identical ' 
are impossible to satisfy and that as n increases the ratio does not 
tend to a definite limit for the effect of wear on the apparatus is 
not inconsiderable. In practice this frequency definition does 
seem to work over a limited range, but it is difficult to fit into 
a mathematical scheme, and is therefore skirted rather warily by 
mathematicians and statisticians alike. 

A definition along the lines put forward for other statistical 
parameters by J. Neyman and E. S. Pearson might seem to hold 
a promise of more validity, although the pitfalls to be met in 
pursuing such a course are many. We may begin with the idea 
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that there is a population of events, and that there exists a 
population parameter, constant and fixed, which describes these 
events, and which we may call a population probability. Such 
a probability would bear no resemblance to the probabilities 
which we have just discussed. For example, if we wished to know 
the population probability of a Chelsea Pensioner who attains the 
age of 70 in 1945, dying before he reaches the age of 71, we could 
by waiting until the end of 1946 find out how many Chelsea 
Pensioners attaining age 70 in 1945 had died before reaching 
age 71, and define our population probability as just that 
proportion of Pensioners who had died out of the total number 
exposed to risk. 

Hence a population probability is just the ratio of the number 
of times which the event has happened divided by the total 
number of events if the population is finite. There are, however, 
many populations in statistics which are not capable of being 
enumerated in this way. For instance the population of the 
tossing of a die or of the throwing of halfpennies will never be 
completed. Nevertheless we shall postulate that these popula- 
tions can be described by a constant parameter, or a population 
probability, which experience has shown to be equal to a certain 
value, and which, if an infinite population were capable of being 
enumerated, would be equal to the proportion of successes in 
the total population. 

Following along the lines of statistical practice we have there- 
fore an unknown population parameter, our population prob- 
ability p, which we postulate exists, and which we desire to 
estimate. We perform a series of experiments and from these 
experiments we derive an estimate of p. For example, we might 
throw a die n times and count the number of times, x, that a six 
fell uppermost. It is clear that the mean value ofx/n in repeated 
sampling will be approximately equal to p, and if it were possible 
to carry out an infinite series of trials, in each of which the die 
was thrown n times, the mean value would be exactly equal to p. 

Let us turn for illustration to the case of our (hypothetical) 
Chelsea Pensioners. It is desired to know the probability of a 
Pensioner who attains age 70 in the year 1945 dying before he is 
71 years of age. As has been pointed out, it would be possible to 
wait until the end of 1946 and calculate the exact proportion of 
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Pensioners who have died before reaching 71. This is equivalent 
to stating that the desired population probability will exist at 
the end of the calendar year but that at the present time it does 
not exist because it is not known which of the individuals 
possesses the required characteristics of dying or not dying. In 
order therefore to estimate the probability of a Pensioner 
attaining age 70 in the year 1945 dying before he is 71 years of 
age it will be necessary to postulate a hypothetical population of 
Pensioners attaining age 70 in 1945 of whom a fixed proportion 
may be expected to die before reaching 71. If we now choose a 
number of other years in which conditions of living were reason- 
ably the same, and calculate the proportion of Pensioners who 
satisfied the required conditions, we may regard the proportions 
thus obtained as estimates of the unknown population parameter 
and from combining them we may make an estimate of which 
can be stated to lie within certain limits. 

Now, if we take this estimated value of p as the probability 
that a Pensioner attaining age 70 in the year 1945 will die before 
he reaches the age of 7 1 , we shall not expect the value of p actually 
calculated from the 1945-6 figures to have necessarily the same 
value. For the 1945-6 figure will be the exact value for that 
particular year, but will itself also be an estimate of the chances of 
death in any year for a Pensioner of the stated age. Hence if we 
alter our question a little and ask what are the chances of death 
before reaching the age of 71 for a Pensioner who attains 
age 70 in the year Y, then the addition of the 1945-6 data should 
give increased accuracy to our estimate of the unknown prob- 
ability and should enable closer limits to be obtained for this 
probability provided all the known causes which might cause 
fluctuations are controlled for each of the years considered. 

This control of causes of variation is important and it may be 
well to digress a little from the main theme in order to consider 
what is meant by it. In any set of figures, whether obtained 
directly by experiment or collected from records, there will be 
variations both between the individual sets and from some values 
which might be expected by hypothesis. It is a commonplace 
to state that before the collection of material is begun, no matter 
what the material may be, all known causes of variation should 
be eliminated or at least controlled. Such known causes are often 
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spoken of as assignable causes because we are able to state 
definitely that they would cause variation unless they were 
controlled. Usually in statistical method the aim of the compiler 
of figures is to eliminate such sources of variation, but there are 
occasions on which assignable variation is too small to influence 
results, or it would cost too much in time and money to eliminate 
it. In such cases the assignable variation remains in the material 
but it is necessary to remember it when interpreting any 
numerical results. 

Thus in discussing the chances of death of a Chelsea Pensioner 
it is clear that we should use for our estimate of probability only 
those figures relating to other Pensioners of the same age and that 
for each year studied we should take care that as far as was possible 
all other conditions were the same. By so doing we should ensure 
that each set of figures gave an estimate of the same unknown 
hypothetical population probability. 

After all the assignable causes have been controlled the result 
of any one experiment or collection of figures is still subject to 
variation from causes which we do not know about and therefore 
cannot control. It is these unassignable causes, or as they are 
more often called, random errors, which create the need for the 
concept of probability. A penny is tossed in the air. If care is 
taken in the spin of the coin imparted by the finger and thumb, 
and if a soft cushioned surface is chosen for the penny to fall 
upon, then it is often possible to determine beforehand whether 
it will fall with the head or the tail uppermost. That is to say by 
controlling certain sources of variation the fall of the penny can 
be predicted. 

On the other hand, if no care is exercised in the tossing and in 
arranging the fall of the penny it is not possible to predict which 
way up it will fall even if the experiment is carried out a number 
of times. Random errors due to unassigned causes determine 
whether the penny shall fall head or tail uppermost and, as Borel 
has written, c the laws of chance know neither conscience nor 
memory'. All that we can know is that if we carry out a series 
of trials and estimate a probability from each, these estimates 
will vary about an unknown population probability, that these 
estimates can be used to estimate this unknown population para- 
meter and the limits within .which it may lie, and that previous 
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experience has shown that if the errors are really random then 
the mean value of estimates from a number of series of trials will 
approximate to a constant number. 

What we have postulated therefore for our definition of 
probability is that the population probability is an unknown 
proportion which it is desired to estimate. This parameter may 
only be estimated by a series of experiments in each of which 
the same assignable causes are controlled and in each of which, 
as far as is possible, no new cause of variation is allowed to enter. 
The result of each experiment may be regarded as an estimate of 
this unknown proportion and the pooling of these estimates 
enables prediction to be made for any new set of experiments it is 
desired to carry out. We have not postulated absolutely identical 
conditions for each of a set of experiments. Experience has shown 
this to be unnecessary and provided the more obvious sources 
of variation are controlled the different probability estimates 
will vary about the unknown true value. We may discuss the 
nature and size of this variation at a later stage. 

The advantages to be gained by defining a probability in this 
way would seem to the writer to be many. For example, such 
a definition will fit more closely into statistical practice than does 
(say) the mathematical theory of arrangements. It is rare in 
statistical practice to be able to state the alternatives of equal 
weight, such as one is able to do with the six faces of a die; in 
fact generally it is necessary to find a probability by evaluating 
the ratio of the number of successes to the total number of trials. 
Under the scheme which we have just set out this would be 
recognized for what it is; an estimate of the unknown probability 
and an estimate which will undoubtedly be different from that 
which will be obtained when more evidence renders another 
calculation possible. Further, if it is shown by experiment that 
several alternatives are approximately equally possible as in the 
case of the six faces of the die or the two sides of a penny, then 
there appears to be no reason why a mathematical model based 
on equi-probable alternatives should not be constructed if it is 
so desired. But the mathematical model can only be established 
after experiment has shown its possible construction, although 
such a construction will be valuable in certain mathematical 
applications. 
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The interpretation of a probability will follow directly from 
the definition which we have given. If as the result of calculations 
we have found that the probability of a given event is p, then 
we should say that if a series of experiments were carried out in 
which all assignable causes of variation were controlled and no 
further large assignable or unassignable causes of variation were 
permitted to intervene, then the mean value in repeated experi- 
ments of the proportion of times in which the event occurred, 
will be approximately equal to p. As an illustration, suppose that 
the results of experiments tell us that the probability of a tank 
crossing a minefield without detonating a mine is 0-86. This 
would mean that the average number of tanks crossing unscathed 
out of every 100 attempting to cross the minefield would be 86. 
We should not be able to say which tank would be blown up, nor 
what the exact proportion of tanks crossing unscathed in any 
given 100 would be, but we should feel sure that the average 
proportion of successes, if a series of trials could be made, would 
approximate to 0-86. 

We began by stating that by producing a numerical probability 
the work of the probabilist should be considered as finished and 
that of the interpreter begun and perhaps the illustration last 
used of tanks crossing a minefield may be helpful in emphasizing 
what is meant by this. To tell a worker in pure probability theory 
that the chance of a tank crossing a minefield unscathed is 0-86 
would be to convey very little information to him unless he also 
happened to be a tank specialist. He may perhaps reply 'Good' 
or 'Not so good' according to whether he regarded 0-86 as a high 
or not so high number, but almost certainly his reply would be 
as the result of the impact of the pure number on his mind and 
little else. 

On the other hand, to tell a general in charge of armoured troops 
that the probability was 0-86 would provoke an instant response. 
If he had plenty of tanks and if advantage was to be gained by 
a swift crossing then he might regard 0-14 as an acceptable risk 
and order his armour to attempt the crossing. On the other hand 
if tanks were few and pursuit not profitable then he might 
regard 0-14 as not acceptable. In this case, as in every other 
interpretation of probability, the numerical result is only one 
of many factors which have to be taken into account in reaching 
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a decision. Generally these other factors are not capable of 
numerical expression or they would have been included in the 
probability calculation. For the experimentalist turned statis- 
tician it is often possible for the role of probabilist and interpreter 
to be combined, for only the person who has collected the material 
can know its exact worth in interpretation; but the professional 
statistician, per se, may only calculate a numerical probability 
and must perforce leave the interpretation of probability in the 
shape of decision for action in the hands of someone else. 
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CHAPTER II 

PRELIMINARY DEFINITIONS AND 
THEOREMS 

In the previous ^chapter we have been concerned with the 
definition and interpretation of what is meant by a probability. 
In this and succeeding chapters the objective will be the setting 
out of the definitions and rules whereby we may build up a theory 
for the addition and multiplication of probabilities. The actual 
theory will be mathematical and there will be no need to interpret 
it in the terms of the world of observation until the logical 
processes of the mathematics are complete and a numerical 
answer is reached. It is useful to note that the theory which we 
shall set out will be applicable to all numerical theories of 
probability; in fact the only difference between any of the 
numerical theories of probability will lie in the definition of what 
is meant by a probability and the interpretation of statistical 
calculations by the help of such a probability. 

We begin by defining the Fundamental Probability Set. The 
fundamental probability set, written F.P.S. for short, will be just 
that set of individuals or units from which the probability is 
calculated. Thus if it is necessary to estimate a probability from 
a series of n observations, then these n observations will be the 
F.P.S. Or if sufficient experimental evidence is available to justify 
the setting up of a mathematical model in the shape of the 
mathematical theory of arrangements, then the F.P.S. would be 
the total number of arrangements specified by the theory. In the 
case of a die the F.P.S. given by the mathematical theory of 
arrangements would have six members, but if experiment had 
shown that the die was biased in some way, and it was necessary 
to estimate a probability from the observations, then the F.P.S. 
would contain the total number of throws of the die which were 
recorded. It is unnecessary to labour the point unduly, but it 
should be noted that the elasticity of the definition does, as we 
have stated above, render the subsequent theory independent of 
whatever definition of probability is used. 
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In order to keep the theory in a generalized form we shall 
speak of elements of the F.P.S. possessing or not possessing a 
certain property when it is desired to calculate a probability. 
For example, in calculating the probability of throwing a two 
with one throw of a die we may speak of an element of the F.P.S. 
possessing the property of being a two, or in calculating the 
probability of an event we may speak of an element of the F.P.S. 
possessing the property of happening or not happening. A definite 
notation will be adopted for this and we shall write P{ E e o) \ 1} 
to stand for the words 'the probability that elements of the 
subset a) possess the property E referred to a fundamental 
probability set 1*. This will invariably be abbreviated to P{E}. 
As stated previously this probability, or strictly estimate of 
probability, will be the ratio of the number of elements of the 
F.P.S. possessing the property E (i.e. the subset co) to the total 
number of elements of which the F.P.S. is composed (i.e. the 
F.P.S. 1). 

It is one of the weaknesses of probability theory that it has, 
possibly through its intimate connexion with everyday life, 
taken certain everyday words and used them in a specialized 
sense. This is confusing in that it is not always possible to avoid 
using the words in both their specialized and everyday meanings. 
As far as is possible we shall attempt to confine the words to their 
specialized meaning only. 

DEFINITION. Two properties, E l and E 2 , are said to be 
'mutually exclusive' or 'incompatible' if no element of the 
F.P.S. of EI and E 2 may possess both the properties E l and E 2 . 

The definition is immediately extended to k properties. 

The definition of mutually exclusive or incompatible properties 
is thus seen to follow along common-sense lines. For example, if 
it was desired to calculate the probability that out of a given 
number of persons chosen at random ten (say) would have blue 
eyes, and nine (say) would have brown eyes, then in stating that 
the property of possessing a pair of blue eyes was incompatible 
with the property of possessing a pair of brown eyes, we should 
merely be expressing the obvious. 

DEFINITION. E l9 E 2 , ...,E k , are said to be the 'only possible' 
properties if each element of the F.P.S. must possess one of these 
properties, or at least one. 
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THEOREM. If E and E% are mutually exclusive and at the same 
time the only possible properties, then 



Let the F.P.S. be composed of n elements of which % possess the 
property E l and n 2 possess the property E 2 . Since E and E 2 are 
mutually exclusive no element of the F.P.S. may possess both the 
properties E l and E%. We have therefore by definition, 

P{E l } = njn, P{E 2 } = n 2 /n. 

Further, since E and E 2 are the only possible properties, each 
element of the F.P.S. must then possess either E l or E 2 , from 
which it follows that n + n = n and that 



Extension of theorem to k properties. An extension of the above 
theorem for k mutually exclusive and only possible properties 
may easily be made by following along the same lines of argument. 
If there are k mutually exclusive and only possible properties, 
E V E 2 , ...,E k , then 



Definition of logical sum. Assume that the F.P.S. is composed 
of elements some of which possess the property E ly or the property 
E 2 , ... or the property E k . The logical sum, E Q , of any number of 
these different properties will be a property which consists of an 
element of the F.P.S. possessing any one of these properties, or 
at least one. This may be written 



Definition of logical product. Assume that the F.P.S. is composed 
of elements some of which possess the property E l9 or the 
property J57 2 , ... or the property E k . The logical product, E', of 
any number, m, of these different properties will be a property 
which consists of an element of the F.P.S. possessing all m 
properties. Thus E' E E 

These definitions may be illustrated by means of the following 
theorem. 
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THEOREM. The logical sum of two properties E l and E 2 is given 

> 

P{E 2 } - 



Let the F.P.S. consist of n elements, n^ of which possess the 
property E v n 2 possess the property E 2 , n l2t possess both the 
properties E l and E 2 , and n Q of which possess neither E v nor E%. 
The proof of the theorem then follows directly from definition. 



n v ' n 

and the result follows. 

COROLLARY. If E l and E 2 are mutually exclusive then 



For if E l and ^ 2 are mutually exclusive then no element of the 
F.P.S. may possess both the properties E l and E 2 . This means 
that n 12 = and therefore that P{E 1 E 2 } = 0. 
Similarly for k mutually exclusive properties 

P S Ei = S 



Exercise. Find an expression for the logical sum of three 
properties E^E 2 and J 3 , i.e. find P{E l + E 2 + E^ and show how 
the expression is simplified if the properties are assumed to be 
mutually exclusive. 



Numerical Examples * 

( 1 ) Given that the probability of throwing a head with a single 
toss of a coin is constant and equal to 1/2, if two identical coins 
are thrown simultaneously once what is the probability of 
obtaining (a) two heads, (6) one head and one tail, (c) two tails? 

If it is given that the probability of throwing a head with a 
single toss of a coin is constant and equal to 1/2, it follows that the 
probability of throwing a tail is also constant and equal to | since 
we may assume that the properties of being head or tail are the 
only possible and they are obviously mutually exclusive. We 
may therefore construct a F.P.S. containing two elements of equal 
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weight; the first of these would possess the property of being 
a head (H) and the second would possess the property of being 
a tail (T). Hence for two coins we should have the following 
possible alternatives 



i.e. a F.P.S. of four elements and the probability of two heads, 
one head and one tail, two tails will be J, , J respectively. 

(2) The probability that any given side will fall uppermost 
when an ordinary six-sided die is tossed is constant and equal to 
1/6. What is the probability that when a die is tossed a 2 or 3 will 
fall uppermost? There are various ways in which this problem 
may be attempted. The setting up of a mathematical model 
along the lines of the previous problem would give a probability 
of 1/3. As an alternative we may note that the property of being 
a 2 is mutually exclusive of the property of being a 3 and that the 
logical sum of the probabilities of a 2 or a 3 will therefore be 



(3) Two dice are tossed simultaneously. If the probability 
that any one side of either die will fall uppermost is constant and 
equal to 1/6, what is the probability that there will be a 2 upper- 
most on one and a 3 on the other? The probability that there 
will be a 2 on the first die and a 3 on the second will be the 
logical product of the two probabilities, i.e. 

P{2 on 1st and 3 on 2nd) = . = ^. 

The problem does not, however, specify any order to the dice and 
it would be possible to get a 3 on the first and a 2 on the second. 
The required probability is therefore 1/18. 

(4) Conditions as for (3). What is the probability that the 
two numbers will add up to 5? Answer: 1/9. 

(5) n halfpennies are tossed simultaneously. If for each coin 
the probability that it will fall with head uppermost is constant 
and equal to , what is the probability that k out of the n coins 
will fall with head uppermost? 

Suppose that the coins are numbered and that the first k fall 
with head uppermost and the second n k with tail uppermost. 
The probability of this is 
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But no order was specified regarding the k heads and they may 
be spread in any manner between the n coins. The number of 
ways in which k heads and n k tails can be arranged is 

n\lk\(n-k)l 
and the required probability is therefore 



(6) Three halfpennies are tossed one after the other. If the 
constant probability of getting a head is for each coin, what is 
the joint probability that the first will be a head, the second 
a tail, and the third a head? 

(7) Four dice are tossed simultaneously. If the constant 
probability of getting any number uppermost on any one die is 
1/6, what is the probability that the sum of the numbers on the 
four dice is 12? 

(8) Five cards are drawn from a pack of 52. If the constant 
probability of drawing any one card is 1/52, what is the probability 
that these five cards will contain (a) just one ace, (6) at least one 
ace? If the probability is constant then a F.P.S. can be set up 
consisting of 52 equally likely alternatives. The number of ways 
in which 5 cards can be drawn from 52 if all cards are equally 

likel y is 521/5! 47!. 

From this number it is necessary to pick out the number of sets 
of 5 cards, one card of which is an ace. This is perhaps most easily 
done by first withdrawing the 4 aces from the pack. The 
number of ways in which 4 cards may be drawn from 48 will be 

481/4! 44!. 

To each of these sets of 4 cards one ace must be added and this 
may be done in 4 ways. Hence the total number of ways in 
which 5 cards may be drawn, one of which is an ace, is 

4.48!/4!44!, 

and the required probability of drawing 5 cards, one of which is 
an ace, is 48! / 52! 

4 '4f44!/5!~47T 

A similar argument may be used for the probability of obtaining 
at least one ace. In this problem the 6 cards may contain just 

DPT 2 
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one ace, or two aces, or three or four. The required probability 
will therefore be 



47 -T4- 48 
52! L 4!44! 



4 4.3.2 48! 48! 



Examples based on games of chance tend to be somewhat 
artificial in that the answer when obtained is without interest, for 
few persons are prepared nowadays to wager large sums of money 
on the fall of a die. Nevertheless it was from a study of the gaming 
tables that the earliest developments of the theory were made 
and such problems are of value if the student learns the elements 
of combinatory theory from their study. 

Definition of relative probability. The relative probability of 
a property E 2 , given a property E v will be defined as the prob- 
ability of a property E 2 referred to the set of individuals of the 
F.P.S. possessing the property E v This will be written as 

P{E Z | tfj. 

The notation may be translated into words exactly as before 
with the addition of the word 'given' represented by the upright 
stroke |. 
THEOREM. Whatever the two properties E l and E 2 , 

P{E^} = P{E,} P{E 2 1 E,} = P{E,} P{E, | E 2 }. 

Let the F.P.S. be composed of n elements of which n x possess the 
property E l9 n 2 possess the property E 2 , n l2 possess both the 
properties E l and E^ n possess neither E l nor E%. By definition, 






and the proof of the theorem follows. The second equality 
follows in a similar way. 

THEOREM. Whatever the properties E l9 E 29 ...,^> 



x P{E, | E,E Z }... P{E k 

The proof follows immediately from repeated application of the 
preceding theorem. 
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Numerical Example 

Three bags A, B and G are filled with balls identical in size and 
weight. Bag A contains M balls of which m l are stamped with 
the number 1, ra 2 with the number 2, and m l + m 2 = M. Bag B 
contains JV^ balls, n v of which are white and JVi % of which are 
black, while Bag G contains N 2 balls, n 2 of which are white and 
N 2 n 2 black. A ball is drawn from A. If it bears the number 1 
a ball is drawn from B; if it bears the number 2 a ball is drawn 
from G. Assuming that the probability of drawing an individual 
ball from any bag is constant and equal to the reciprocal of the 
number of balls in the bag, what is the probability that, if a ball 
is drawn from A and then from B or (7, the second ball is white? 
Describe in detail the F.P.S. to which this probability may refer. 

It is possible to state the required probability immediately. 
Using the notions of logical sum and logical product we may say 
at once that the probability that the second ball is white is 



m l 

~ 



It is required, however, to discuss the possible mathematical 
model which may be set up for the calculation of this probability 
under the given assumption of all balls within one bag being 
equally likely to be drawn. One mathematical model would be 
the enumeration of all possible pairs of balls which might be 
drawn. These will be 

\W IB 2W 2B 
abed 

where a, 6, c and d are the number of these pairs. It follows that 

pm = i = 

1 ' 



_ -~ 

M a+b+c+d 9 * M 

P{W I 1} = -- 1 = -% P{W | 2} = ^ = - c . 

1 ' ; $ a + b x J N c + d 



The probability of getting a white ball, given 1 and 2, will therefore 
be 

P{W \ 1 and 2} = 

1 ! ; 



2-2 
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and a solution of the equations for a, 6, c and d will give a complete 
enumeration of the F.P.S. 

Definition of independence. The property E l is independent of 
the property E 2 if 



THEOREM. If the property E l is independent of the property 
E 2 then the property E 2 is independent of the property E v 
From the definition of independence, if E 1 is independent of E 2 

then = P{E l | E 2 }. 



It is required to show that E 2 is independent of E l9 i.e. 

P{ 2 } = P{E Z | jy. 

The result follows immediately from a consideration of the 
logical product of E l and E 2 . 

P{E,E Z ] = P{E,} P{E Z | EJ = P{E Z } P{E, \ E,}. 

Using the fact that E l is independent of E 2 the converse follows. 
Example. If E^ and E 2 are mutually exclusive, can they be 
independent? The answer to the question follows directly from 
the definitions. E l and E 2 are mutually exclusive if no element 
of the F.P.S. possesses both the properties E l and E 2 . That is to 

Sft y P{E, | E z } = = P{E Z | E,}. 

The condition for independence has just been stated as 

E z }, P{E 2 } = P{E Z | 



Hence E l and E 2 can only be mutually exclusive and independent 
if P{EJ = = P{# 2 ), 

which is absurd. It follows therefore that E t and E 2 cannot be 
both mutually exclusive and independent. 

Example. Consider three properties E l9 E 2 and E 3 . Given 
(i) the F.P.S. is finite, (ii) E l is independent of E 2 , (iii) E l is 
independent of E& (iv) E l is independent of E 2 E%. 

Prove that E l is also independent of E 2 + E 3 . 

Again the solution of the problem follows directly from the 
definitions previously given. Let the F.P.S. be composed of 
n elements, n of which possess the property E l9 n 2 the property 
E 2 , n 3 the property JK 3 , n l2 both the properties E l and E 2 , n 23 both 
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the properties E% and E& w 31 both the properties E 3 and E l9 n 123 
all three properties E V E 2 ,E 3 , and n Q possess none of these 
properties. 

From the given conditions 

P{J0J = P{E, | E,} = P{E, | # 3 } = P{tt l | EM = a (say). 
Substituting for these probabilities we have 



^123 



^3 + ^31 + ^23 + ^123 ^2 

A solution of these equations and a simple rearrangement shows 
that = + 



and since a = 

the result follows. It will be noticed that no mention was made 
of independence (or otherwise) between E z and E 3 and the result 
will hold therefore whether these two properties are independent 
or not. 

These then are the preliminary definitions and theorems which 
are necessary for the development of probability theory. All 
will be used in the exposition which follows although we shall not 
necessarily restate each theorem or definition at the time at 
which it is used. It should be the aim of the reader so to familiarize 
himself with the concepts that reference back to this chapter 
becomes unnecessary. Further, each and every stage of a 
calculation of any probability, no matter how trivial, should be 
followed by an interpretation of what the probability means, or 
would mean if experiments were carried out. Only by such 
repetition can the theory of probability acquire for the reader 
both meaning and sense. 

REFERENCES AND READING 

There would appear to be little argument possible about the theorems 
outlined in this chapter. The reader will notice that all proofs are based 
on the assumption that the F.P.S. is finite. The propositions can be 
shown also to hold for an infinite F.P.S. but appeal would need to be made 
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to the theory of sets in order to justify the proofs. The reader must 
perforce accept the propositions as true for all cases. 

If further examples are required they may be found in many algebra 
text-books or in the chapter headed 'Probability' in W. A. Whitworth, 
Choice and Chance, whose examples and illustrations are exhaustive. 

It will be useful for the reader of these books to add to both the 
question and answer the words necessary before a probability can be 
calculated or interpreted. For example, the favourite problem of the 
probability of drawing balls from an urn is incalculable unless it is 
assumed that all balls have an equal probability of being drawn, and 
so on. 



CHAPTER III 

THE BINOMIAL THEOREM IN 
PROBABILITY 

Following the definitions and theorems for the addition and 
multiplication of probabilities which have been set out in the 
previous chapter it would be possible to solve any problem in 
elementary probability; for it is not possible to conceive a problem 
which could not be solved ultimately by its reduction to first 
principles. Nevertheless, from the basic raw material of our 
subject as represented by these first principles, it is possible to 
fashion tools which add not only to our appreciation of its 
applications but which also seem greatly to extend our knowledge. 
Intrinsically the application of the binomial theorem to prob- 
ability described below is just a rapid method for the calculation 
of probabilities by the joint application of the elements of 
probability theory and combinatorial analysis. Nevertheless, 
because of its utility in modern probability and statistical theory, 
it will be advantageous to consider the use of the theorem in 
some detail. 

The problem in which the binomial theorem is most frequently 
employed is sometimes referred to as the problem of repeated 
trials. This arises from the fact that it generally presupposes 
a series of repeated trials in each of which the probability of an 
event occurring is constant; it is required to state the probability 
of a given number of successes in a total of repeated trials. In 
order to prove the result it is not necessary to assume that this 
probability, constant from trial to trial, is also a known prob- 
ability. We shall, however, begin by an illustration in which the 
probability of a single event occurring is assumed as known. 

Let us consider the case of the tossing of a halfpenny when it 
is known that the constant probability of a head is \ in a single 
trial. If the halfpenny is tossed twice then, as we have seen 
earlier, an appropriate mathematical model will be 

H 1 H 2) H^T^ T^H^ T l T 2y 
with the probabilities of J, , of obtaining two heads, one head 
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and one tail, or two tails respectively with two spins of the coin. 
If the halfpenny is tossed three times in succession the appropriate 
alternatives will be 

H 1 H 2 H 3) H^H^T^ H^T^H^ T^H^H^ 

Hi^a^3 ^1^2^3> T^T^H^ ^1^2 ^3- 

The three tosses can give the possible results of three heads, two 
heads and one tail, one head and two tails or three tails with 
corresponding probabilities ^, f , f , \\ but already the enumeration 
of the possible alternatives is becoming cumbersome and leaves 
the way open for errors. It is clear that for cases in which ten or 
more tosses were made, the appeal to first principles, although 
still a possibility, would need considerable calculation, whereas, 
as will be shown, the required probabilities may be obtained 
immediately by an application of the binomial theorem. 

THEOREM. If the probability of the success of the event E in 
a single trial is constant and equal to p, then the probabilities of 
k successes (for k = 0, 1 , 2, . . . , n), in n independent trials are given 
by the successive terms of the expansion of 

(q+P) n > 
where q = 1 p. 

Let P nm k denote the probability that an event, the constant 
probability of the occurrence of which in a single trial is p, will 
happen exactly k times in n trials. In order to prove the 
theorem, therefore, it is necessary to prove the identity 



There are many ways in which this may be done but possibly one 
of the simplest is by induction, dividing the proof into three parts. 
(1) The identity is true for n = 1, i.e. 



This follows directly from definition. P x is the probability that 
in a single trial the event will not happen, that is P x equals q. 
Similarly P lul is the probability that in a single trial the event 
will happen. Hence P lml must be equal to p. 

(2) Assume that the identity is true for m and prove that if it is 
true for m it is true for m+ 1. The assumption is therefore that 
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Multiply each side by (q+px) and collect the coefficients. 



Consider the different coefficients separately. 

q.P m = the probability that an event will not happen at 
all in m trials multiplied by the probability that it 
will not happen in a further single trial. 

Hence 2-^n.o = 

Similarly it may be argued that 

9 * = ^ 



It will be sufficient for the other coefficients to consider a typical 
term, say the coefficient of x k . 

q.Pm.k+P-Pm.k-i^ *h e probability that an event will 

happen exactly k times in m trials 
multiplied by the probability that it 
will not happen in one further trial, 
plus the probability that it will happen 
k - 1 times in the first m trials multi- 
plied by the probability that it will 
happen in one further trial. 

It is clear therefore that 



(3) It has been shown that if the identity is true for m it is also 
true for m+1. It has also been shown that the identity is true 
for n equal to unity. Hence if it is true for n equal to one it is true 
for n equal to two and so on universally. Writing x equal to unity 
we have 



and the theorem is proved. 

The function (q+px) n is sometimes called the generating 
function of the probabilities P Htkt . 

Example. If the constant probability of obtaining a head 
with a single throw of a halfpenny is , what is the probability 
that in twelve tosses of the coin there will be nine heads? 
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The answer will be the coefficient of x 9 in the expansion of 
12 > that is i* wil1 be 

12! /1\ 12 55 



9!3!\2/ 1024" 

This result may be compared with that of Example (5) of 
Chapter n. (Page 16.) 

Example. If the constant probability of obtaining a two 
uppermost with a single throw of a die is 1/6, what is the 
probability of obtaining 3 twos in 8 throws of the die? 

Answer 



. _8!/lW5 
' 3!5!\6/ \6 



Example. (The problem of points.) Jones and Brown are 
playing a set of games. Jones requires s games to win and Brown 
requires t . If the chances of Jones winning a single game is p, 
find the probability that he wins the set. This problem is a version 
of the famous ' problem of points ' which has engaged the atten- 
tion of many writers on classical probability. The solution follows 
directly from an application of the binomial theorem but its 
artificiality should be recognized. It appears doubtful whether 
skill at games can ever be expressed in terms of chances of winning 
and, further, it would seem that when chances of winning are 
spoken of with regard to a set of games something in the nature 
of a subjective probability is meant and not purely the objective 
probability of number. However in spite of its artificiality the 
problem is not without interest. 

If Jones' chances of winning a single game is p, then Brown's 
must be q = lp, because either Jones or Brown must win the 
single game. Suppose Jones takes s + r games to win the set. In 
order to do this he must win the last game and s 1 of the 
preceding s + r 1 games. The probability that Jones will win 
51 out of s + r 1 games, when the constant probability that 
he wins a single game is p, may be written down directly from the 
binomial theorem. It will be 



(s-l)\r\ ' 

It follows then that the probability that Jones wins the set in 
games will be the probability that he wins the last game 
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multiplied by the probability that he wins 51 out of the other 
s + r 1 games. This is 



Now Jones may win in exactly s games, or s + 1 games, or s + 2 
games, ... or s + 1 1 games. The probability that he wins the set 
is therefore obtained by letting r take values 0, 1,2, ...,(- 1) 
successively and summing, i.e. Jones' chance is 



An interesting algebraic identity may be obtained by considering 
Brown's chance of winning and its relation to Jones' chance. 

Thus far we have treated binomial applications in which the 
constant probability of a single event is known. It is, however, 
not usual in statistical problems that this probability should be 
known and in the majority of cases it is necessary to estimate it 
from the data provided. We shall return to this point at length 
at a later stage but an example here will serve as further illustra- 
tion of the application of the binomial theorem. 

Example. The report of the dean of a Cambridge college 
showed the following figures: 



Subject 


Number of students 
examined 


Number of honour 
grades 


Number of 
failures 


Mathematics 
Music 
All subjects 


466 
22 


162 
11 

38% 


38 

5-4 % 



What is the probability that 

(1) in selecting 466 students at random one would obtain as 
few honour grades as were obtained in mathematics, and as 
many failures? 

(2) in selecting 22 students at random one would obtain no 
failures, as in music, and 11 or more honour grades? 

The percentage of students obtaining honour grades in all 
subjects is 38. Further, the problem states that 466 students are 
to be selected at random. It would seem therefore that the 
appropriate value for the probability of obtaining an honour 
grade is the proportion of honours students within the popula- 
tion, i.e. 0-38. Similarly the probability for failure may be taken 
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as 0-054. The answers will be, from direct application of the 
theorem, 

466 1 466 1 



1621304! 
(2) -, (0-054) (0-946)" 



The '11 or more' honour grades requires us to calculate the 
probabilities of obtaining 11, 12, ..., 22 honour grades. This will 

^ e 22 1 22 1 



oof 22 1 

+T3T9! <0-38)"(0-62) + ... +^( 

In the foregoing example we have treated a problem in which 
a sample is drawn randomly from a population and, in so doing, 
may appear to have diverged a little from the problem of repeated 
trials. If we consider the procedure in detail, however, it will be 
obvious that the problem of sampling from an infinite population, 
or sampling with replacement from a finite population, is iden- 
tical with the problem of repeated trials. In the problem of 
repeated trials we consider an event E which has a constant 
probability of happening in a single trial. If n trials are performed, 
then the binomial theorem enables us to calculate the probability 
that the event will happen exactly k (say) times in these n trials 
without specifying anything about the order of these k successes; 
in fact the k successes are supposed to occur in an entirely random 
way. Now consider the drawing of a sample at random* from 
a population. For illustration we may imagine the population to 
consist of a box containing disks. If a disk is chosen at random 
from the box it will mean that any one disk has the same chance 
of being chosen as any other disk, and provided the disk is 
returned to the box after each drawing, the probability of choosing 
any one disk will be constant from trial to trial. Hence if a disk 

* A sample randomly drawn from a population is commonly spoken 
of as a 'random sample'. We shall follow common usage by writing of 

* random samples' but it is necessary to remember that the adjective 

* random' should apply to the method of drawing the sample and not to 
the sample itself. 
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is taken out and returned ten times, we may say that we have 
selected a ' random sample' of the disks, but we might also say 
that we had made repeated trials which were ten in number. In 
the case of the Cambridge students, provided each student in the 
population was given the chance of being chosen more than once 
(i.e. provided the disk is returned to the box), the choosing of 
466 students at random is equivalent to making repeated trials 
466 in number, the probability for choosing an honours student 
being constant from trial to trial. 

Example. A population is composed of equal numbers of red 
and white disks. These disks are identical in all respects except 
for colour, A disk is chosen at random and replaced 8 times. 
What is the probability that this sample of 8 will be made up 
of 0,1,2, ..., 8 red disks? 

We may assume that the probability of obtaining a red disk 
at a single trial is \. It is stated that 8 trials are made. The 
required probabilities are therefore given by successive terms of 

the binomial 71 1XQ 

(Hi) 8 - 

Example. In the 8 offspring from the mating of a hybrid (Aa) 
and a recessive (aa), 7 were observed to be hybrids and one 
recessive. Is this result exceptional? 

Following the genetical hypotheses (see Chapter vm) it is 
clear that the only offspring from the mating Aa x aa can be 
hybrids ( Aa) and recessives (aa) and that these may be expected 
to occur in equal numbers. In other words, the probability of 
obtaining a hybrid is and the hybrid and recessive are the only 
possible properties. It follows that the probability of obtaining 
7 or more than 7 hybrid offspring from such a mating will be 



We interpret this probability by stating that, on the average, 
seven times in 100 we should expect to obtain 7 or 8 hybrids in 
a family of 8 and we cannot therefore consider that the reported 
7 hybrids are exceptional in number. 

Let us return now to a study of the binomial probabilities 
generated from (q+p) n . This series of probabilities will only be 
symmetrical for p = q = \. For p greater than \ the largest term 
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will be towards the right of the distribution, for it is to be expected 
that if the probability of success in a single trial is large then the 
probability of obtaining a high proportion of successes would 
also tend to be large. The position of the largest term, or of the 
two largest terms, may be found by means of two simple 
inequalities. 

It has been shown that 



Let the largest term be when k = k Q . It will follow that 

PH . V-l ^ PU . k Q > PU . fc + 1- 

Consider first the left-hand inequality. Substituting for P n tk 
and P rltk __! we have 

__ ^ __ ftk -l a n-k +l < ___ ' 

(k -l)\(n-k +l)\ P q ^k \(n-k ) 

from which it follows that 



Similarly it may be shown from the right-hand inequality that 



so that (n+l)p^k Q >(n+ 1)^ 1. 

The largest term of the binomial series may therefore be found 
quickly for it is the term corresponding to the integer which 
satisfies this inequality. 

Example. Find the largest term in the expansion of (Hi) 8 - 
Here p = ^ and n = 8 and we have 

4-5 ^fc > 3-5. 

This means that the largest term of the expansion will be when 
k Q = 4. It may be pointed out, however, that if the largest term 
is when & = 4 this will imply that the fifth term of the series is 
the greatest, for the number of successes may be 0, 1 , 2, 3, 4, . . . , 8. 
Exercise. Find the largest term of the following binomials. 

(1) (Hi) 10 . (HI) 10 , (HI) 10 - 

(2) (Hi) 20 > (HI) 20 , (HI) 20 - 

Calculate the distributions and thus check your results. 

This exercise shows clearly the way in which a change in p 
and in n alters the shape of the binomial distribution. Generally, 
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however, we shall not be concerned with the term of greatest 
probability, for it is not of much utility in statistical theory. We 
shall consider the two other collective characters, the mean and 
the standard deviation of the distribution. 

A knowledge of the moments of the binomial distribution is 
necessary for several reasons. First, by studying these moments 
and the derived collective characters /? x and /? 2 an idea of the 
shape of the distribution can be arrived at just as quickly as 
from a study of the most probable term, and possibly more 
accurately. Secondly, in cases where it is necessary to approxi- 
mate to the binomial by some frequency curve it will be necessary 
to know at least the first two moments. Thirdly, in fitting the 
binomial series to a set of observations it is usually necessary to 
estimate both n and^. This may be done most quickly by equating 
the mean and standard deviation of the binomial series to those 
calculated from the observations. We shall therefore extend the 
framework of our theory and derive the moments of the binomial 
series. 

THEOREM. If the probability of the success of an event E in 
a single trial is constant and equal to p, then the theoretical 
distribution of the probabilities of obtaining 0, 1, 2, ... successes 
in n trials will have the following first four moments: 
(1) Mean = /i[ = up, (2) Variance = /i 2 = npq, 
(3) /* 3 = npq(q -p), (4) /e 4 = npq[l + 3pq(n - 2)], 

where q+p = 1.* 

The calculation of these theoretical moments may be exactly 
paralleled in the reader's mind by the calculation of the moments 
of any given frequency distribution, such as is worked out early 
on in statistical practice. 

Regarding the probabilities as frequencies we have 



where k may be compared with the distance from the arbitrary 

n\ 
origin and j-r- ~jriP k y n ~ k may be compared with the group 

rC \(ll/ K J ! 

* It is assumed that the student will have read enough statistics to 
be familiar with the notation for moments. Briefly /4 indicates the jfcth 
moment about any arbitrary origin, and fi k indicates the fcth moment 
about the mean. 
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frequency. There is no need here to divide by the sum of the 
' frequencies' because in this case it is unity. If the term up is 
taken outside the summation sign it will be seen that the terms 
inside are simply another binomial series with n 1 as index 
instead of n, and are accordingly equal to unity. Hence 

Mean = fi[ = up. 
The variance follows in the same way. 



- 

whence /4 = n(n 

Applying the correction to /4 in order to obtain the variance 
that is the second moment about the mean, we have 



Similarly 

n M ! 

,,' __ y 3 "" 

/3 ~A-?o k kr(n- 






= n(n 

n 

and y/i = 5 



= n(n-l)(n-2)(n-3)p* 
+ 6n(n- 1) (n-2)p s 

It will be necessary to convert /4 and ^4 from the arbitrary 
origin to the mean. These corrections are 



and using these relations easy algebra gives 



which proves the theorem. 
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As a corollary it is straightforward to show that 
# 1 4 a A 
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2 



n 



Example. In 103 litters of 4 mice the number of litters which 
contained 0, 1, 2, 3, 4 females were noted. The figures are given 
in the table below: 



Number of female mice 





1 


2 


3 


4 


Total 


Number of litters 


8 


32 


34 


24 


5 


103 



(1) If tha chance of obtaining a female in a single trial 
is assumed constant, estimate this constant but unknown 
probability. 

(2) If the size of the litter (4) had not been given, how could it 
be estimated from the data ? 

(3) How could the assumption that the chance of obtaining 
a female in a single trial is constant be tested? 

(1) The mean of the observations given is equal to 

T ^[8. + 32. 1 + 34. 2 + 24. 3 + 5. 4] = 1-864. 

If it is assumed therefore that the number of litters in each class 
divided by 103 is an estimate of the binomial probability in each 
class we shall have __ I.OJM 

whence, since n is given equal to 4, it follows that p is equal to 
0-466. 

(2) If n is not given but must be estimated from the data it 
will be necessary to calculate the variance of the given observa- 
tions. This is equal to 1-030. We have, therefore, equating the 
theoretical variance of the binomial to that of the observational 
data, that 



npg = 

Dividing by the relationship 

np = 1-864, 

we have q = 0-553, p = 0-447 and n accordingly approximately 
equal to 4. Since in this case n must be an integer we should have 
no hesitation in estimating the litter size as 4 and readjusting the 
probability accordingly. 
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(3) There are several ways in which the adequacy of the 
initial assumptions may be tested. The simplest one perhaps is 
to calculate the frequencies as given by the terms of the 
theoretical binomial 

103(0-534 + 0-466) 4 

and compare with the actual frequencies. 

Observed frequency 8 32 34 24 5 103 

Theoretical frequency 8 29 38 23 5 103 

There is no need for a further test here to find out whether the 
theoretical hypothesis as given by the binomial adequately 
describes the observational data. The agreement between theory 
and observation is good and we may say that there is no reason 
why the probability should not be assumed constant. 

Exercise. A cast of 12 dice was made 26,306 times and the 
frequency of dice with 5 or 6 points uppermost was recorded. 
W. F. R. Weldon found the following distribution: 

Number of dice with 1 2 3 4 5 8 789 10 11 12 
6 or 6 points 
Observed frequency 185 1149 3265 5475 6114 5194 3067 1331 403 105 14 4 

Check whether the mathematical model whereby all sides of 
a die may be assumed equi-probable is suitable for this set of 
observations and calculate the theoretical frequencies appropriate 
to the binomial hypothesis. 

Exercise. A wooden target is divided into 1000 squares. Shots 
are fired at the target, the aiming being supposedly random 
within the area of the target. The distribution of the number of 
shots in any one square is the following: 

Number of shots (*) 01234 5 6 7 89 10 11 

within a square 
Number of squares 1 4 10 89 190 212 204 193 79 16 2 

with k shots 

How could the hypothesis that the aiming was random within 
the target area be tested from these figures? 
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CHAPTER IV 

EVALUATION OF BINOMIAL 
PROBABILITIES 

It will not have escaped the notice of the reader that the 
evaluation of binomial probabilities if n, the number of trials, 
be large, will lead to a certain amount of somewhat tedious 
arithmetic. The evaluation of a single value of P n k may be carried 
out fairly quickly, provided tables of log-factorials are available, 
but it is rare that a single probability is required. More often 
than not in statistical practice we are concerned with finding 
the probability of obtaining a number greater than or less than 
a given number and it becomes necessary to evaluate the sum of 
a number of binomial probabilities. In this case the calculations 
are frequently lengthy and allow much scope for error. It is 
worth while therefore to study such methods as there are for the 
evaluation of such a sum, and to consider what approximations 
to the binomial series have been made. We shall do this in several 
stages. 

1. RELATION BETWEEN THE BINOMIAL SERIES AND 
THE INCOMPLETE B-FUNCTION RATIO 

The complete B -function may be defined as 



= f 
Jo 



o 
and the incomplete B-function as 



,(a,r) = I 

Jo 



B,(a,r) = a?- l (l-x) r - l dx for 0<t<l. 
Jo 

Provided $ and r are integers the complete B-function may also 
be expressed in terms of complete F -functions and thence in the 
ratio of factorials, viz.: 
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The incomplete B-function ratio is the ratio of the incomplete 
B -function to the complete B-function. In statistical practice 
it is commonly written 

I t (s,r) = ^~^~ = f a?- l (l-xy-*dxl( l a?- l (i--xr- l dx. 
>(<*> f) Jo /Jo 

We begin by considering the incomplete B-function ratio, 
I p (k, n k + 1), where n, k and p have their usual meanings. 
From the above definition 



The function may easily be evaluated by integrating by parts: 

x k ~ l ( 1 - x) n ~ k dx = - -I 
o k 



Writing q = 1 p it follows that 



which is equivalent to saying that 



It will be recognized therefore that, provided the incomplete 
B-function ratio is tabled, the sum of any number of binomial 
terms may be obtained. For example, if the sum of a number of 
terms from k^ to k 2 is required, then this is the difference of two 
incomplete B-function ratios, and so on. 

Tables of the incomplete B-function ratio were prepared in the 
Biometric Laboratory, University College, London, and edited 
by Karl Pearson. The function /^s, r) is tabled for r over the range 
to 50, and the values of s for each particular value of r extend 
from s = r to 5 = 50. The argument oft is in hundredths. Thus in 
the table entries it is not possible immediately to extract the 
incomplete B-function ratio if r > s. This omission was, however, 
deliberate in order to cut down the length of the table, since 
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a simple mathematical transformation is all that is required in 
such cases. We have noted that the incomplete B -function ratio 
is defined as 



Write 1 x = y and we have 

.iyi LI/" ~f l 



"Jo 



o 
that is I t (s, r) = 1 I^ t (r, s). 

Hence the incomplete B -function ratio can be determined for any 
value of r and s between and 50. This would suggest that the 
largest value for n would also be 50, and this would be so for 
a complete enumeration of the binomial series. If, however, the 
sum of terms is required for some k > up it will be seen that it is 
possible for n to be greater than 50 and that the extreme case 
which may be evaluated will be for k = 50 and n = 99. 

In such tables we have therefore a weapon of great utility 
by which much laborious calculation may be avoided, provided 
that the n and k of the binomial series are sufficiently small to 
fall within the range of the tables. The answer obtained from 
these tables is exact. For values of n falling outside the range of 
the tables it will be necessary to use an approximation. Such 
approximations which do not involve more calculation than the 
working out of the binomial terms themselves are not valid for 
n small but may be used freely for n greater than 50. 

2. APPROXIMATION TO THE BINOMIAL SERIES 
USING A HYPERGEOMETRIC SERIES 

It has been pointed out by Uspensky* that an approximation 
to the binomial series, using properties of the hypergeometric 
series, was put forward by Markoff and that this approximation 

* The analysis of this section follows directly along the same lines 
developed independently by J. V. Uspensky and J. Miiller. (See 
references at the end of the chapter.) My proof follows closely that given 
by Uspensky, although I have inserted more detail, because I feel that 
in general his proof cannot be improved upon. 
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has not received the recognition which is undoubtedly its due. 
This is, possibly, owing to the fact that the normal curve is well 
tabulated and the effort involved in approximating to a sum of 
binomial probabilities by part of the area of a normal curve (see 
next chapter) is very much less than that involved in Markoff's 
approximation. Nevertheless, the hypergeometric approxima- 
tion needs fewer mathematical assumptions than does the 
normal approximation and for this reason we shall deal with it 
in some detail. 

It is required to evaluate P{k > fcj, that is, it is required to 
evaluate 

n w f 

V P ___ -_ ____ *i+ltfn-*i-l 

n ' k ~ P q 



n \ 



The assumption is made that k^np. This assumption is not, 

ki 

however, restricting, for if k < up then 2 P n k could be evaluated 

fc=0 

by this same method and subtracted from unity to give the 
required probability. The first term of the required sum may be 
put outside a bracket as follows: 

71 n 1 

V P , =- __ _ - 

** 



*i+ 2 (k l + 2) (k l + 3) q* 

and this outside term may be evaluated quickly with the aid of 
tables of log-factorials and logarithms. It remains therefore to 
sum the finite series within the brackets which will be recognized 
as a particular case of the hypergeometric series 



,,, 

where 

a = -n + k 1 +l, fi=l, 7 = ^ + 2, z = -p/q. 

If we define 
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easy algebra will verify that 



0?+ 



_ , 

or - 



Y A. 

2n+1 = A 2n + z - 



Writing a 2n and a 2 r?+i f r the coefficients of X 2n+l and 
respectively, we shall have generally 



Hence, if we give v values 1,2,... successively, we obtain a series 
of relationships between the X's the first of which will be 



or 



By utilizing the successive relationships we may write this down 
as a continued fraction 



1- 

~ X. 

Let X n = 1 and 



' 






,2;) = ^(- n + ^+1, 1,^ + 2, -p/q) 
so that 

a = -11 + ^+1, yff = 0, y = &!+!, z = -p/q. 
Writing d w = -za^, c w = za 2w _ 1? 

it may be shown by substitution that 

, _ _ ^(^+_^ ) _ P _ (n a) k l )(k l + a)) p 

"" 
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Hence if 8 is the sum of the hypergeometric series which we are 
proceeding to evaluate, i.e. if 

Q_ n-^-lp (n-fej- 

+ 



then /Sf = v 1 - 

A 



1+d, 
1- 



'n-fe.-l 



1 +<*-*,-! 

i ' 

Referring back to our definition of c w it will be seen that if ^ is 
positive and less than unity so will be all the other c's. c x was 
denned as n-k*-lp 

/ * 

C i- ^ + 2 q' 

k is essentially positive and less than n, p and q are positive 
fractions and n is a positive integer. It follows therefore that c l 
is positive or at least zero. For c x to be a fraction it is necessary 

(k l + 2)q>(n k l l)p or k l + 2>np+p. 



It was assumed that k^np, so the inequality holds good. 

d^ is essentially positive and, accordingly, if we consider the 
sum of the continued fraction 



84 = 



it is clear that c i >s i >0 

s\ 

and that Si = - ? 



This last expression will give us all that is required for 
evaluating S. The necessary steps in the calculation will be: 

(1) Choose i to be any number desired. Obviously the greater 
i the more accurate will be the approximation. Uspensky 
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suggests taking i = 5, but this is only something which may be 
learnt by experience and it may be possible to take i smaller than 
5 for certain values of n. It will not pay to make i too large because 
the approximation will not then save a great deal of calculation. 
(2) Having chosen i, calculate c <+1 , thus obtaining the upper 
limit to the inequality 



(3) Calculate d i and c i and, using the limits for 
limits for s i from the relation 

*i 



, obtain 



s< = v V 



(4) The calculations (3) are repeated to obtain successively 
limits for ^_ 15 ^_ 2 , ...,8 ly S. 

(5) The binomial term outside the bracket is evaluated either 
directly by logarithms or by means of an inequality discussed later. 

(6) The multiplication of the results of calculations (4) and 
(5) give an approximation to the required sum of binomial 
probabilities. 

n 

Example. Find 2 P n k when n = 80, ^ = 40, p = 0-4. This 

=^+1 

may be evaluated directly and exactly from the incomplete 
B-function ratio tables. We have 

P{k>k l } = 7 . 4 (41,40) - 0-0271,236. 

In practice we should not contemplate using the hypergeometric 
approximation if the exact value could be obtained from tables. 
However, for the purposes of illustration let us apply the theory 
which we have just outlined. First it is necessary to test whether 
the only assumption holds. Is k ^ up ? Here k^ = 40 and up = 32 
and we may therefore proceed. 

(1) Following Uspensky let us choose i = 5. 

(2) 0<s 6 < 0-39316. 

(3) and (4) give 



0> 


Co, 


d 


5 

4 
3 
2 

1 


0-42857 
0-46809 
0-51240 
0-56237 
0-61905 


0-11111 
0-09524 
0-07678 
0-05522 
0-02990 



0-36224 <5 5 < 0-38571 
0-40526 <* 4 < 0-40727 
0-45364 <* 8 < 0-45381 
0-51073 <* a < 0-51075 
0-58340 = ^ = 0-58340 



S = 2-40038 
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(5) The binomial term 

n! 

1 1* _L 1 ^ ! (m If 1 \ ! " 
\A/J T^ * / \ 'v **'i ! y 

= 4^9-, (0'4) (0-6) = 0-01 1300. 

(6) The required sum of the binomial probabilities is 0-027124. 
It will be seen that this agrees well with the value as calculated 

from the incomplete B -function ratio tables but the calculation 
involved is rather heavy. Even so the calculations are few in 
number compared with those which would be necessary in order 
to evaluate the binomial series term by term. The hypergeometric 
approximation involves no restricting assumptions and may be 
made as accurate as desired by increasing the size of i which is 
at choice. It should therefore be used for sums of binomial 
probabilities outside the range of the B -function ratio tables for 
which a definite accuracy is required. We shall discuss later other 
approximate methods which give the sums of binomial probabili- 
ties quickly, but these approximations (the normal curve and 
Poisson's limit, treated in succeeding chapters) rest on certain 
mathematical restrictive assumptions and it is not always 
possible to judge the accuracy of the results obtained from their 
use. While therefore they may be adequate for the rough deter- 
mination of a probability level they are certainly not satisfactory 
for calculating a sum of terms when other calculations are to be 
based on the result. 

Exercise. Given that the constant probability that an event 
will happen in a single trial is 1/3, find the probability that in 
100 trials 45 or more events will happen. 

Exercise. A halfpenny is tossed 200 times. The head fell 
uppermost on 153 occasions and the tail on 47. If it is assumed 
that the constant probability of head or tail in a single trial is 1/2, 
would you consider such an experimental result to be exceptional ? 

Exercise. Consider Weldon's dice experiment of the previous 
chapter. Evaluate the probability of obtaining 9 or more dice 
with 5 or 6 uppermost when 12 dice are thrown. 
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3. APPROXIMATE EVALUATION OF A SINGLE 
BINOMIAL PROBABILITY 

When the n of the binomial series is small it is an easy matter 
to evaluate any given binomial term. When n is large it will be 
necessary to refer to tables of log-factorials and it is possible that 
occasions may arise when these are not readily available or, as 
is sometimes the case, the binomial index may lie outside the 
range of existing tables. It is rare however, that the computer 
has not access to tables of logarithms, and it is useful therefore 
to give inequalities for a single binomial probability when the 
binomial index is large. These inequalities will occur again in 
a slightly different form in the proof of Laplace's theorem (next 
chapter) and the approximation will accordingly serve the further 
purpose of making the reader familiar with them. It is required to 
evaluate 

P , - ' 

~ 



, 
n ' k 



- 
kl(n-k)\ 



when n, k, and n k are all large numbers. 
It is known by Stirling's theorem that 

where < 6(m) < ^r . 

12m + 6 v ' 12m 

Expanding the factorials in the expression for P n tk we have 

- vfV*v* , r T = exp [0(n) - 0(fc) - 0(n - *)]. 

/ n \* InpY I nq \ n 

\2nk(n k)J \k / \n-k/ 

From the inequalities for 6(m) given above, since n is greater 
than k or n k it will follow that 



12w + 
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and therefore that 

1 1 1 



12k 

< exp [0(n) - 0(k) -0(n-> 

~\F'ir\ I . . . I 

v \_l2n + Q 12& + 6 12(n-k) + 6_\' 
Hence 

1 1^ 

"l2k~12(n- 

P.. 



ii. k 



n-k 



I n \* lnp\ k I nq \ 
\2nk(n-k)J \k) \n-kj 

r i i i 

6XP L"l2 + 6 12k + 6 I2(n - i 

all the terms of which can be evaluated by logarithms. 
Example. Test this approximation for the binomial term 



the exact value of which is 0-011330. 

We begin by evaluating the divisor of P n k by logarithms. 



= 0-011335. 



2,nk(n-k)) \kj \n- 

The left- and right-hand sides of the inequality may be 
determined either from tables of the negative exponential 
function, or by logarithms, or in this case perhaps more easily 
from the first three terms of the negative exponential series. 
Thus 



which gives 0-011330 = P Umk = 0-011330. 

The approximation in this case agrees with the exact value to six 
decimal places. 

It will be recognized that, provided tables of the log-factorials 
are available, there is no advantage in using this approximation 
in preference to calculating the exact value; for the arithmetic 
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involved in the use of the approximation is, if anything, a little 
heavier than for the exact value. If, however, no tables but 
ordinary logarithms are available then the approximation must 
perforce be used. 

Exercise. Calculate the exact values of the binomial prob- 
abilities for 

(i) 7^ = 20, k= 11, ^ = 0-4; 
(ii) n = 200, fc= 110, 2> = 0-4 
and compare with the approximate method. 
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the Incomplete B-function to the sum of the first p terms of the binomial 
(a + 6) n '), but it was almost certainly known before this century. 

The hypergeometric approximation has been discussed by J. H. 
Miiller (Biometrika, xxn, p. 284, 'Application of continued functions 
to the evaluation of certain integrals with special reference to the 
Incomplete B-function'). Miiller believed his method to be original. It 
does not appear, however, to differ greatly from that outlined by 
Uspensky and attributed by him to Markoff. 



CHAPTER V 

REPLACEMENT OF THE BINOMIAL SERIES 
BY THE NORMAL CURVE 

The fact that it is possible under certain conditions to replace 
a binomial series by a normal curve has been known for many 
years. The first derivation of the curve was given by Abram de 
Moivre (1718) in The Doctrine of Chances and he derived it in 
order to be able to express a certain sum of probabilities. The 
formula was not, however, stated explicitly as a theorem and it 
was not until the advent of Laplace (1812) with his Theorie des 
Probabilites that the relationship between the binomial series 
and the normal curve was given clear mathematical expression. 
Since Laplace it has become increasingly frequent for those 
seeking to give numerical form to the sum of a given number of 
binomial terms to refer to the normal approximation, and the 
tabled areas of the normal curve are used for such evaluations on 
occasions when the restrictions and assumptions of the approxi- 
mation can hardly hope to be justified. We shall try to show at 
a later stage the limitations within which the application of the 
theorem may appear to be legitimate. The theorem may be stated 
in many forms. We shall state it in the following way: 

LAPLACE'S THEOREM 

If n is the number of absolutely independent trials, such that 
in each trial the probability of a certain event E is always equal 
to p, whatever the result of the preceding trials, and if k denotes 
the number of trials in which an event E occurs, then whatever 
the numbers z l and z 2 , where z l < 2 2 , the probability 

i7 nP ^ z 2\-*-17ir\\ 2 e~* z *dz as n->oo. 
VfoOT) / V( 27r )J*i 

Before proceeding with the direct proof of the theorem it will be 
convenient to begin by stating and proving a lemma which will 
be necessary at different stages of the proof. This lemma appears 
to be due to Duhamel. 



48 Probability Theory for Statistical Methods 
LEMMA. Consider two sequences of positive sums 

Cf Cf cr 

such that 

A;,, A'n 

i = l i=l n->oo 



If, whatever e > 0, it is possible to find n e , such that 

P'n.i_ 

P ni 
then the existence of a finite limit 



<e for i=l,2, ... 9 N n and n>n 



lim S n = S 

n > oo 

implies that of S n , namely 

lim ,, = lim $ n = S 

n -> oo n > oo 

and conversely. 

Proof of lemma. If $ n tends to a finite limit 8 then there exist 
fixed numbers M and n^/ such that for any n > n M , S n < M. Write 



Sum and take absolute values. We have then 



P' V P 

* n.k~~ 2j * n 



Consider any ^7 as small as desired and write 

e = TjfM. 
If n is greater than both n M and n then we shall have both 



Hence 



S n <M and 



Now TJ may be chosen as small as desired. It follows therefore that 
if S n tends to a finite limit S then S tends to the same limit and 
the lemma is proved. 

Proof of theorem. We begin by rewriting the probability 



k np } 
1 ^ V^Ptf) ^ j 



as P{np + z l <J(n 



Replacement of Binomial Series by Normal Curve 49 
Denote by Tc^ the smallest integer such that 

up + z l *J(npq) < k l9 
let k 2 be the largest integer such that 



and substitute in the probability 
P{np + z l J 



The probabilities that k = k and the succeeding terms are 
binomial probabilities as given in the statement of the theorem. 
Hence 

P{(k = kj + (k = fc 1+ 1) + ... + (fc = k 2 )} 

7c 2 A" 2 n \ 

_ V P V 

2j -Ln.k 2j 



We want to find an approximation to this sum and we therefore 
look for an expression P' n%k . If an expression P^.k can be found 
such that p f 

and if for any number e > 0, where e is as small as desired, we may 
find a number n e , such that for n > n and k ^ k ^ k 2 



then, by Duhamel's lemma if there is a finite limit to 2 P' n k 

fc=/fi 
fc 2 
there is a limit to ^ P n k and these two limits will be equal. 

k=ki 

Stirling's expansion for n\ may be written in the following 
form for n large 

n ! = n* exp - " - 



where 0<^< 1. Expanding the factorials in P n fc by moans of 
this expression, after some arrangement we obtain 

nl 



k\(n-k)\"* \kj \n-k 

VV f^^r-r^ I A 2 3 



12(n - 
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where Q<6 19 6 29 6^ < 1. Write P' n .k equal to the terms on the 
right-hand side which do not include the exponential and it 
follows that 



r 0i 02 

- - - -4 ___ - 

L 12n 12A; 



A ____ 



Since n, k and n k are positive numbers and < V 2 , 3 , < 1 it 
is clear that 



_ _ _ 

1271 12fc 2(n - ) 12fc I2(n-k) ' 

but the right-hand side is dependent on k as well as n. Now by 
definition Jfc t is the smallest integer such that 



and it will follow therefore that 

1 



12(np + 
Similarly from the definition of k z it will follow that 



1 .<_, ' 



I2(n - & 2 ) ^ 1 2(nq - 

These inequalities will hold good for any k, such that fc x < k ^ k 2 , 
if we write k instead of k and 2 . The fundamental inequality 
containing the 0's accordingly becomes 



1 



I2(np + z l y(npq)) 12(nq 

We may now choose n as large as required so that the right- and 
left-hand sides of the inequality will differ from zero by as little 
as desired. It follows that 

exp, - <>' * 



f #1 

I - 

L I2n 



will differ from unity by as little as desired and that P' nmk will 
therefore satisfy the conditions of the lemma. We may proceed, 
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fc a 

then, to look for a limit to ^ P' n k knowing that if it exists then 

k=ki 

k 2 

it will be the desired limit of P n k . P' n k was defined as 

& = *! 

1 /np\*+* I ng \ n ~ k +* 
~ 



Write k = np + z *J(npq) 9 

where z l ^ z < z 2 . 

If we denote by Az the difference between successive values of z, 
it follows, since k may take only integer values, that 

k+1 = np + (z + Az) <J(npq), 
from which it is obvious that 

Az 1 



and that Az tends to zero as n increases without limit. Rearrange 
P' n k and take logarithms. 



The last two terms on the right-hand side may be expanded as 
a series in the form 



where < | <j> \ < 1, from which, by writing R n z for the collected 
terms of z 3 , we shall have 



2 
The lemma of Duhamel may now be applied again. Let 
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It will be seen that P" n k \P' n k will differ from unity by as little 
as desired for some value of n greater than a given number. 

A'2 

From the lemma, therefore, if there is a limit to X P'n k & w ^ a ^ so 

be the limit to X ^n.k an d therefore to X P n .k- Using the 
definition that a definite integral is the limit of a sum, we have 

and therefore 

k 2 
lim X P nfc = 

n-> oo k~k\ 

and the theorem is proved. 

The fa and fa of the binomial series were shown to be 



4 
n 9 

Provided therefore that p remains finite, it can be seen that as 
n increases without limit fa will tend to and fa to 3, that is to 
the fa and fa of the normal curve. 

From the definitions given in statistical theory of the moments 
of a distribution there would appear to be no reason why the fa 
and fa of the distribution of a variate which may only take 
discrete values should not be calculated. Yet in making such 
a calculation the student should ever bear in mind exactly what 
it is he is doing, fa and fa are two measures devised by Karl 
Pearson to express skewness and flatness of frequency curves. 
Now the distribution of a binomial variate can never be a fre- 
quency curve. It consists of a discrete set of points and can never 
be the distribution of a continuous variate. 

The fact that this is so, however, does not prevent us from 
seeking to express the sum of a number of binomial probabilities 
in terms of a continuous function. That we may do this was seen 
in the last chapter when it was shown that the sum of a number 
of binomial probabilities may be expressed exactly in terms of 
the incomplete B-function ratio. There is no reason why we 
should not express the sum of a number of binomial probabilities 
in like manner by means of an area of the normal curve. If it is 
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desired to find P{k^k l }, where & is a binomial variate, then, 
with the customary notation, we may reduce this variable by its 
mean and standard deviation, i.e. 

k np 



and refer this to the extensive existing tables of areas of the 
normal curve. Such a procedure is quite a legitimate one for it 
implies the conversion of one function so that entry may be made 
in the known tables of another function. 

It may be shown simply, using the Euler-Maclaurin theorem, 
that a better approximation to the sum of a number of binomial 
terms may be made by entering the normal tables with 

k-np $ k-np 

-r -- instead 01 r- - - 
<j(npq) \'( 

Thus we may give the rules that 

+> 



and 

These rules may be memorized easily by imagining that the 
binomial probabilities may be represented by a frequency distri- 
bution in which P n k is constant over the range (k \] to (k + 1). 
Such an assumption is, of course, wholly fallacious, since the 
binomial probabilities are a discrete set of points, but it is a 
useful aid to memory if not pursued too tenaciously. Table A 
below gives illustration of the effect of this corrective term. 

The cases were chosen arbitrarily and it will be noted that in 
each case the effect of the corrective term is to bring the normal 
approximation more closely in line with the exact probability 
enumerated from the B-function ratio tables. Even for n = 10 
the evaluation of the probability sum, as given by the corrected 
normal deviate, is not very different from the exact value. This 

* The reader may compare this corrective factor of : - with the 

2 

'continuity' correction for x 2 * n ^ ie case of a 2 x 2 table. 
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close correspondence for n comparatively small is of interest 
because so far we have not discussed the appropriate number 
below which n should not be taken for the normal approximation 
to have validity; in Laplace's theorem it will be remembered 
that the area of the normal curve only tends to represent the 
binomial sum as n increases without limit. 

TABLE A. Sum of binomial probabilities estimated by 
an area of the normal curve 





PtffcJbJ 


n 


P 


q 


* 


Exact 


Normal ( k ^ np \ 


Normal ^"^"^ 


jLNUrlllaJ. 1 - i ' 1 
\<J(npq)J 




10 


0-3 


0-7 


6 


0-1503 


0-0838 


0-1503 


10 


0-5 


0-5 


5 


0-6230 


0-5000 


0-6241 


10 


0-7 


0-3 


6 


0-9526 


0-9162 


0-9577 


20 


0-3 


0-7 


10 


0-0480 


0-0256 


0-0436 


20 


0-5 


0-5 


10 


0-5881 


0-5000 


0-5882 


20 


0-7 


0-3 


10 


0-9829 


0-9744 


0-9858 


30 


0-3 


0-7 


15 


0-0169 


0-0084 


0-0143 


30 


0-5 


0-5 


15 


0-5722 


0-5000 


0-5721 


30 


0-7 


0-3 


15 


0-9936 


0-9916 


0-9952 



To obtain some idea of the degree of approximation involved 
is difficult in that there are various ways in which it may be desired 
to use the normal areas. It is conceivable that an estimate of 
each of the binomial probabilities in a given series may be 
required. If p is not very different from , then for n as little as 
5 the areas of the normal curve will agree with each of the bi- 
nomial probabilities to within 1%. For n =10 the correspondence 
is good. However, the main use to which we may expect to put 
the normal approximation is for tests of significance. 

The statistician is concerned with the setting up of arbitrary 
probability levels whereby hypotheses may be tested. Because 
these levels are arbitrarily chosen, the acceptance or rejection of 
a hypothesis cannot be insisted on too rigorously if the calculated 
probability falls near the significance level. For example, if the 
5 % level of significance is decided on a priori, and the calculated 
probability was found to be 0-057, the statistician would feel no 
more certain about the acceptance of the hypothesis under test 
than he would about its rejection for a calculated probability of 



Replacement of Binomial Series by Normal Curve 55 

0-043. In the absence of further evidence he would consider that 
the issue was in doubt. It follows therefore that for approximate 
tests of significance which may be carried out first of all on raw 
material in order to make preliminary judgments, the reduced 

binomial variate *. ^ i 

K np $ 



which it is recognized may not be exact, may be used in con- 
junction with the normal probability scales. This is advantageous 
in that the normal probability levels are easily remembered and 
a quick rough test of significance may therefore be made. 

If the rejection level is 10 %, then n may be as small as 5 andp 
may vary from 0-1 to 0-9 if a variation of 4 % is allowed in the 
level. For instance, the error involved in assuming normality 
for n = 5 and p = 0- 1 is 2 % at the 10 % level. A smaller rejection 
level will not have much meaning for n = 5. For n = 10 and 
upwards with p varying from 0-3 to 0-7 the error involved is of 
the order of 1 % at the 5 % level. The lower the significance level 
chosen the more likely is the normal approximation to be in error 
and at the 0-005 level only the value p = q = will be found to 
lie within 0-004 to 0-006 for n as large as 50. However, provided 
it is remembered that the test is approximate then little error 
will result from its use. As a general rule it may be remembered 
that for p< 0-5, whatever n, the normal test will tend to over- 
emphasize the significance at the upper significance level, while 
for p> 0-5 it will underestimate it. 

Example. The proportion of male births within the whole 
population is of the order of 0-51. A family is observed composed 
of 10 males and no females. Assuming that order of birth does 
not affect the probability of being born male, is it considered 
that a family so constituted is exceptional? The probabilities of 
obtaining 0, 1, 2, ..., 10 males in 10 offspring will be given by the 
generating function (Q-49 + o-51) 10 . 

Hence the probability that out of 10 children all will be males 
will be given approximately by 

10- 10(0-51) -0-5 
(10x0-51x0-49)* 

referred to tables of the normal probability integral, The 
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numerical figure for the probability is 0-0027. That is to say it 
might be expected that a family so constituted would occur less 
than 3 times in 1000 families composed of 10 offspring. 

Example. From previous experience it has been found that 
when a river overflows on a road an average of six in every ten 
cars manage to get through the flood. Fifty cars attempt the 
crossing. What is the probability that thirty-five or more will 
get through? 

Here the generating function is (0-4 + 0-6) 50 and 



is desired. Using the normal approximation the deviate is 

35-30-0-5 



(30x0-6x0-4)* 



= 0-89, 



giving a normal probability of 0-19. That is to say, the chance 
is only approximately 4 to 1 that 35 or more cars will get 
through. 

Example. At a given distance, x feet, from the explosion of 
a bomb, it is known that the probability of a pane of glass being 
smashed is 0-3. What is the probability that out of 100 panes of 
glass situated at x feet from the explosion 40 or more will be 
smashed? 

The normal deviate is 

40-30-0-5 



(100x0-3x0-7)* 

and the equivalent normal probability is 0-02. We should say 
therefore that it is doubtful whether so many panes of glass will 
be smashed at a distance x feet. 

Example. If the chance of a house being hit by an incendiary 
bomb is 0-1 and if 25 bombs fall randomly within the area in 
which the house is situated, what is the probability that the 
house receives more than one bomb ? 

The normal deviate is 

2-2-5-0-5 



(25x0-1x0-9)* 
and the chance that the house receives more than one bomb is 
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therefore 0-75. The exact probability calculated from the 
B -function tables is T / 



On three-quarters of the occasions on which 25. bombs fall in 
an equivalent area, the house will receive two or more bombs. 

REFERENCES AND READING 

A detailed development of the replacement of the sum of a number of 
binomial probabilities by the normal probability integral will be found 
in recent papers by S. Bernstein, where the error term is included. 

A simplified version of Bernstein's theorem has been put forward by 
J. V. Uspensky, Introduction to Mathematical Probability. 

A simplified version of the proof given in this chapter will be found in 
J. L. Coolidge, Introduction to Mathematical Probability. 

Most statistical text-books use the normal approximation, with or 
without the corrective term, with a certain amount of indiscrimination. 
The student will find many examples to work through in such text-books. 



CHAPTER VI 

POISSON'S LIMIT FOR BINOMIAL 
PROBABILITIES 

When p is very small but finite, it is necessary for n to be 
very large indeed before the normal integral will approximate 
to a sum of binomial probabilities. The incomplete B -function 
ratio tables do not extend below p = 0-01 and it is necessary 
therefore to find another method of approximating to the 
required sum. This we do in Poisson's limit for binomial 
probabilities. 

During a war the ideas of the individual regarding the funda- 
mental sets on which probabilities are calculated are liable to 
vary according to whether he is exposed to immediate risk or not. 
For example, to the individual under fire the subjective prob- 
ability set will be composed of two alternatives, survival or 
non-survival. To the statistician, however, calculating a prob- 
ability on the total number of persons exposed to risk, the chance 
of any one individual being hurt was found to be very small. 
In fact, many chances for what might be expressed as war risks 
were found to be small and examples for which Poisson's limit 
to the binomial was valid were numerous. 

Poisson's limit to the binomial, like the normal approximation, 
has been considerably misused in statistical practice. It is some- 
times referred to as Poisson's law of Small Numbers; a bad 
nomenclature in that the limit loses its binomial parentage and 
may lead to misunderstanding of the fact that the * law ' is nothing 
more than a limit, under certain conditions, for binomial 
probabilities. 



POISSON'S LIMIT FOR BINOMIAL 
PROBABILITIES 

THEOREM. If n is the number of absolutely independent trials, 
and if in each trial the probability of an event E is constant and 
equal to p , where p is small but finite, and if k denote the number 
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of trials in which an event E occurs, then provided m and k 
remain finite ^kp-m 



as n increases without limit, where P n k has its usual meaning 
and m = up. 
It is convenient to begin with a simple inequality.* If 



where n, is a positive integer. This inequality is self-evident. 
From this inequality it is clear that if 






We shall find it necessary to use the inequality in this form. 

P n k is defined as 

n ! kn-k 

= 



Writing m = np and rearranging 



n 

" 



Applying the fundamental inequality we may obtain an in- 
equality for P Utk . 



/ m\ n - k m k i k\M~v / 

1 -- T - 1 -- <-Pnfc<( 

\ n/ &! \ TI/ \ 



1 i o 

kl\ 2n 

* See J. V, Uspensky, Introduction to Mathematical Probability, 
pp. 135-7. The proof given here follows his outline. 
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Since k, m and n are positive it may be shown, by means of a 
transformation of tho type 



., . /, m\*- k r k n-k ,"1 

that 1 -- <exp -m+ m m 2 

\ n} L n "2n 2 J 



, 

and 



The right-hand side of the inequality for P n k may therefore be 
rewritten as follows: 

m k rk n k k(kl)~] 

Pn k < ~rv e~~ m exp - m - , -w 2 - --- . 

k\ [_n *n 2n J 

Consider now the left-hand side of the inequality for P n k and 
use the same device as before. 

(\ tik i \ (ti ic) 

n) \ n m) 

r m(k m) n k m 2 (n k) m 3 "I 

>exp -WH -+ rt ; ro"-- o ~< r* 

r n -m 2 (n -m) 2 3 (7i-7Ji) 3 J 

L \ / \ / J 

(7 \ if if i\ / i \ K if i ^ r~* 7/1 T \~i 
J/* \ B\'*'~" */ / \ ""** ^/ I fc"*! J<* 111 
i--} =( I H j] >exp --;- -j{ \, 
W I \ tv\ l I A I Vl^ I/* 1 I 

fv/ \ ft/ ft/^ I ^Ifl/ /J 

from which it follows that 

L* wi ~ I/* f I/* _ 

A/ /ft' A/ 1 A>* 



^ * O'V1 r \ 

l! e 6XP 



tA:m 
1T" 



3 n-m)* 
where < < 1. If therefore we write 

km 2 k(k 



i 
and 

(n-m) 3 

the inequality for P n k becomes 



"1 

7T > 
-^)J 



771^ 
x>--?w 
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Now provided k and m remain finite as n increases, ^ and 6 will 
differ from unity by as little as desired, and therefore both sides 
of the inequality will differ from unity by as little as desired. 



It follows that m k 



as 



provided both m and k are finite. 

Under the assumptions through which the Poisson limit is 
reached it would be possible intuitively to write down its 
moments. They are, however, quickly reached by the elementary 
method previously used for binomial probabilities. If 

Yftk 

P m = -re~ m = lim P n k (k and m finite), 



oo 



then /4 = S TT e~ m k r , 

fc=o &! 

from which it is seen that 

/i[ = m, fa = m, /4 3 = m, /^ 4 = 3m 2 + ra 

and /?i = , /? 2 = 3H . 

ri m ^ m 

One point should be noticed here. The summation for moments 
was taken over the range & = 0to& = +oo and not, as is strictly 
correct, over the range k = to k = n. It is necessary to do this 
because oo w & n ^k 

S e -m = 1, while 2 TT *~ m ^ 1- 
fc=ofc! AJ-O&! 

The approximation involved by this alteration of the limit of 
the summation sign is, however, of negligible proportions as 
may easily be shown by an actual calculation of the terms 
involved. 

The Poisson limit has several advantages over the true 
binomial probabilities provided the conditions laid down for its 
use are justified. The incomplete B-function ratio tables are 
actually tabulated for arguments of 0-01 of the constant prob- 
ability p 9 but where p is less than 0-01 interpolation into the 
tables becomes difficult indeed. The Poisson Limit is extensively 
tabled and the extraction of probabilities for p small is thus a 
simple procedure. In mathematical form it is more tractable to 
handle than the true binomial probability, while for arithmetical 
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purposes the fact that the mean and variance are equal means 
that only one set of calculations need be carried out. 

Example. Compare the true binomial probabilities and those 
obtained by assuming Poisson's limit for n = 10 and p = 0-1. 



k 





1 


2 


3 


I p (k, n - k -f 1) - I 9 (k + 1, n - k) Binomial 
m k e~ m lk ! Poisson 


0-34868 
0-36788 


0-44631 
0-36788 


0-20501 
0-18394 


0-05889 
0-06131 


k 


4 


5 


6 


7 


I p (k, n-fc-f l)-I v (k+ 1, n k) Binomial 
m k e~ m /k ! Poisson 


0-01130 
0-01533 


0-00150 
0-00307 


0-00014 
0-00051 


0-00001 
0-00007 


k 


8 


9 


10 


Total 


I 9 (k, n fc+1) /(&+!, n k) Binomial 
m k e~ m lk ! Poisson 


0-00000 
0-00001 


0-00000 
0-00000 


0-00000 
0-00000 


1-00000 
1-00000 



At first sight the agreement between the two series does not 
seem too good. This is partly because of the large number of 
decimal places taken; for considering that n is only equal to 10 
the agreement when the first two decimal places only are taken 
is as close as could be expected. For^> smaller than 0-1 the agree- 
ment between the exact calculated values and the Poisson limit 
should be closer, as also for the same p of 0-1 and larger n. It 
is well to note, however, that such divergences do occur. 
Poisson's limit should not be applied blindly and without due 
regard for the conditions of the theorem. 

Example. A caterpillar 2x inches long starts to cross at right 
angles a one-way cycle-track T yards wide at a speed of /feet per 
second. If cycles are passing this particular spot at random 
intervals but at an average rate of n per second, what is the 
probability that the caterpillar reaches the other side safely? It 
may be assumed that the impress of a cycle tyre on the ground 
is equal to t inches and that the caterpillar may only be touched 
to be considered hurt. 

If the caterpillar is 2x inches long and the impress of the tyre 
on the road is t inches then the best estimate we can make of the 
probability that a caterpillar will be run over by a single cycle is 



P 



36T 
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If the speed of the caterpillar is / feet per second and the track 
is T yards wide the caterpillar will take 3Tjf seconds to cross the 
track. Moreover, if cycles are passing with an average frequency 
of n per second during these 3T 7 // seconds an average of 3Tn/f 
cycles will pass the given spot. We require the chance that none 
of these cycles do more than graze the caterpillar, that is we 
require 



Generally the probability of one cycle running over the cater- 
pillar will be very small so that the Poisson limit will give us the 
approximate evaluation of this chance as 



It will be remembered that this answer is correct only if the 
original estimate concerning the single constant probability is 
correct. We have no means of judging from the question whether 
this is so. 

Example. Over a generation of students have been amused 
by Bortkewitsch's data, now classical, of deaths from kicks by 
a horse in 10 Prussian Army Corps during 20 years. The material 
he gave was as follows: 



Actual deaths 
per corps 


Frequency 
observed 


Frequency 
Poisson's limit 





109 


109 


1 


65 


66 


2 


22 


20 


3 


3 


4 


4 


1 


1 


5 








Total 


200 


200 



The mean of the observed frequency is 0-61 from which we have 
m = np = 0-61; p = 0-003. 

The chance of being killed by a kick from a horse in any one year 
in any one Army corps is therefore extremely small. The frequency 
using Poisson's limit may be found directly from tables (Molina) 
entering with m. It may well have been that at the time at which 



64 Probability Theory for Statistical Methods 

Bortkewitsch wrote conditions were sufficiently stable to allow 
him to consider that p could be constant over a period of 20 years. 
This state of affairs is, however, hardly likely to obtain to-day 
and it should be remembered in the grouping of such data that 
the fundamental assumption is that there is a small but constant 
probability that the event will happen in any one trial. 

Exercise. The emission of a-particles from a radioactive 
substance was measured by Rutherford and Geiger. If t is the 
number of particles observed in units of time of J minute, and 
if n t is the number of intervals in which t particles were observed 
then the experimental results may be expressed by the following 
table: 



t 

*>i 



57 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


Total 


203 


383 


625 


532 


408 


273 


139 


49 


27 


10 


4 





1 


1 


2612 



Calculate the probability that a single particle will be observed 
in a single time unit and fit Poisson's limit to the observed 
frequencies if it is considered justifiable. 

Example. A man is standing some distance from blasting 
operations in a quarry. Five small pieces of debris fall randomly 
within a space of 10 sq.yd. If the plan area of a man is 2 sq.ft., 
what is the probability that if he were standing within the 
10 sq.yd. he would not have been hit by any of the pieces of 
debris ? 

It is stated that the debris fall randomly within 10 sq.yd. This 
implies that of the 90 sq.ft. any one part is as likely to be hit as 
any other, and that we may say therefore that the probability of 
any 2 sq.ft. receiving a single piece of debris is 

P = FO- 

Five pieces of debris are received in 10 sq.yd. Hence the prob- 
ability that none of these is received by any particular 2 sq.ft. is 



Conversely the probability that he will be hit by at least one 
piece is ii 



We have noted previously that the fundamental assumption of 
the binomial theorem on probabilities and therefore of Poisson's 
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limit is that the probability is constant from trial to trial. This 
does not mean that if an observed series of frequencies is found 
to be graduated closely by a Poisson series the underlying 
probability must necessarily be constant, although given full 
information it might be a reasonable hypothesis to make. If 
the probability of a single event varies considerably from trial 
to trial so that the material collected is not homogeneous, 
Poisson's limit may be a good fit to the collected data, but it is 
possible that a series known as the negative binomial may be 
more appropriate. It is easy to see how such a case might arise. 
We have seen that for the binomial 

up = /&[, npq = /* 2 , 

whence q = / 2 //4- If /^2 * s greater than /i[ then q is greater than 
unity and p = 1 q is negative. Since /4 is positive this implies 
that n must be negative also and that a series of the form 



would be the appropriate one to fit. In certain cases a negative 
binomial maybe expected from hypothesis, as in the theory given 
by Yule for the proportion of a population dying after the nth 
exposure to a disease. Unless, however, some such theoretical 
background can be constructed for a problem the advantage of 
fitting such a series may be questioned. The reader must remember 
that p is the probability that an event will occur in a single trial 
and is a priori postulated as lying between and 1. n is the 
number of trials and must be a positive integer. It would appear 
wrong therefore to carry out calculations in which p and n are 
given negative values. Further, the object of fitting a series to 
obtain any graduations of observed frequency is to enable 
conclusions to be drawn from such a fitting, and it is difficult to 
see what conclusions could be drawn from the fitting of a negative 
binomial unless it is expected on theoretical grounds. 

It was first pointed out by ' Student ' that series in which the 
variance is greater than the mean arise from the probability p 
not remaining constant from trial to trial. These situations are 
not uncommon in bacteriological work. ' Student ' found that the 
distribution of cells in a haemacytometer did not follow a Poisson 
series although it might reasonably be expected to do so. The 
hypothesis put forward to 'explain' this was that the presence 

DFT 5 
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of one cell in a square of the haemacytometer altered the prob- 
ability that there would be another, owing to the first exerting 
some attraction on the second, and so on. Karl Pearson writes, 
' if two or more Poisson series be combined term by term from 
the first, then the compound will always* be a negative binomial' 
and remarks that this theorem was suggested to him by c Student ' . 
It is, however, not altogether certain that the converse of the 
theorem will hold good. 

Pearson actually considered only the case of graduating the 
negative binomial by two separate Poisson series, which may 
account for the fact that this method is not always satisfactory 
in practice. His object was to obtain some means whereby a 
satisfactory interpretation could be given to the observational 
data. Actually, of course, unless there is an a priori reason to 
expect such a dichotomy, the splitting of the data into two 
Poisson series may not help very much. 



PEARSON'S THEOREM FOR THE GRADUATION 
OF THE NEGATIVE BINOMIAL, 

A series of N observations, for each of which the fundamental 
probability p may vary, may be described by the two Poisson 
series 



where m and w 2 are the roots of the equation 

m 2 (a 2 al) m(a^ a l a. 2 )+a 3 a l al = 0, 

and ci^a^a^ are obtained from the moments of the series of 
N observations by means of the relationships 

a i = PI> a 2 = /4 - /4> % = p* ~ 3 /4 + 2 p[ 



A ^ _ Pi- m _2. ^i : _ Pi -MI 

and IY ~~ > TIT ~~~ 

N m l -m 2 N m 2 m l 

The proof of the theorem is straightforward and may be left to 
the student as an exercise. 

* This is not strictly true. 
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Example of use of the theorem. 'Student 5 gave the count of 
yeast cells in 400 squares of a haeraacytometer in the following 
table: 



Number of yeast cells 





1 


2 


3 


4 


5 

1 


Total 


Frequency 


213 


128 


37 


18 


3 


400 



Here /t[ = 0-6825, /* 2 = 0-8117, /* 3 = 1-0876, 

giving #=1-19, # = -0-19, % = -3-59, 

so that the negative binomial would be 

400(1- 19 -0-19)- 3 ' 59 . 
Solving the equations given in the theorem, we obtain 

v^ = 237, m x = 0-385, 

v a =163, ra 2 = 1-116, 

whence by calculating out these two Poisson series a good fit is 
obtained. 



No. of yeast cells 





1 


2 


3 


4 


5 


6 


7 


1st Poisson series 
2nd Poisson series 

Total 


101-44 
53-32 

215 


62-11 
59-52 

122 


11-95 
32-22 

44 


1-53 
12-36 

14 


0-15 
3-45 

4 


0-01 
0-77 


0-00 
0-14 


0-00 
0-02 


1 


Observed frequency 


213 


128 


37 


18 


3 


1 



Many other negative binomials may be graduated in this way by 
the addition of two Poisson series. However, as we have noted, 
the dichotomy is not always satisfactory and possibly this is 
because more than two Poisson series may be required adequately 
to describe the observations. 

Neyman has discussed what he terms a new class of ' con- 
tagious distributions ' which, it seems, will be applicable to many 
types of heterogeneous data and which will moreover give at 
least as good a fit and probably a better fit than many of the 
existing series. The moments of the distribution may be 
derived by the reader at a later stage since they will follow most 
naturally from the application of the theory of characteristic 
functions. However, the practical use of the theorem is pertinent 
at this point and we shall therefore state it without proof. 
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NEYMAN'S THEOREM DESCRIBING DATA 
IN WHICH //2>/'i 

A series of .AT observations, for each of which the fundamental 
probability, p, may vary, may be described by the series 



where P{X = 0} = exp ( _ m^l 

JC being the number of "successes", 0, 1, 2, ... , 



and 



ra 2 = 



// 2 and /^ are calculated from the observations ; m 1 and w 2 are 
essentially positive. 

There appears to be no reason why this series should not fit 
adequately all binomial type series for which /i 2 is greater than 
p[. Since the positive binomial will probably be sufficient when 
fi 2 is less than //{, and the simple Poisson for /i 2 approximately 
equal to /4, it will be seen that Neyman's series extends the range 
of theoretical distributions necessary for describing frequency 
distributions for which // 2 >/4- 

We must remark, however, as for the negative binomial, that 
the fitting of the series will only be of practical use provided 
the estimated parameters are capable of physical interpretation. 

Example. Greenwood and Yule give a table of frequency of 
accidents in 5 weeks to 647 women working on H.E. shells. 
A simple Poisson distribution fitted to these figures does not 
graduate the observed frequencies very well, the reason being, it 
is supposed, that accident proneness is different for different 



Number of 
accidents 


Observed 
frequency 


Poisson 
distribution 


Negative 
binomial 
distribution 


Neyman's 
series 



1 
2 
3 
4 
5 


447 
132 
42 
21 
3 
2 


406 
181) 
45 
7 
1 
0-1 


442 
140 
45 
14 

5 . 
2 


448 
128 
49 
16 
5 
1 


Total 


647 


648 


648 


647 
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women. Greenwood and Yule fit a type of negative binomial to 
the material and obtain a good fit. The writer has fitted Neyman's 
series to the same material and it will be seen that this series gives 
a slightly better fit than the negative binomial. It is, however, 
difficult to see what the parameters of either distribution mean. 
The only drawback to the use of Neyman's series would appear 
to be the relatively heavy computation which is involved if the k 
of the series is over 10 (say). However, this is a slight failing, 
because it is rare that series of this type are found with k very 
large and in any case it should not be difficult to devise a suitable 
computational scheme. 

Exercise. The number of defective teeth in alien Jewish 
children (boys) aged 12 years is given in the table below: 



t = No. of teeth affected 





1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


Total 


n e = No. of boys with t 


73 


56 


37 


52 


31 


18 


22 


6 


9 


3 


2 


2 


2 


313 


teeth affected 































Fit (a) Poisson's series, (6) the negative binomial series, (c) Ney- 
man's contagious series to this material. Can you suggest any 
reason why the variance is larger than might have been expected ? 
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CHAPTER VII 

PROBABILITIES A POSTERIORI 
CONFIDENCE LIMITS 

In the first chapter we defined the population probability of 
a characteristic as being the proportion of units possessing that 
characteristic within the population. It is clear therefore that 
if the population probability is known then it is possible to 
specify all the various compositions which a sample drawn from 
that population may have and the probability that each of these 
compositions will arise in the random drawing of a single sample. 
For example, when considering a pack of 52 cards, if the prob- 
ability of drawing any one card is 1/52 and five cards are drawn 
randomly from the pack, all variations of the 5 cards can be 
enumerated and the probability of drawing any one particular 
set may be calculated. Thus from a knowledge of the population 
we are able to specify the probability of the sample and the most 
probable composition of the sample. Such probabilities are often 
referred to as probabilities a priori in that prior knowledge of 
the population probability is necessary for their evaluation. 

We now turn to what are termed probabilities a posteriori and 
we find that the position is the reverse of the a priori probabilities. 
Now all that is known is the composition of the sample and it is 
required from this knowledge to estimate the most probable 
composition of the population from which it has been drawn. 
Obviously if repeated sampling could be carried out then the 
composition of the parent population, i.e. the proportion of 
individuals possessing the characteristic A, can be estimated very 
nearly. However, it is not always possible for this repeated 
sampling to be carried out and we shall therefore discuss methods 
of estimating a population probability which have been put 
forward in the past (and mostly rejected), and the present-day 
method of confidence intervals. 

For many years the centre of statistical controversy was 
centred around the theorem on probabilities a posteriori, loosely 
spoken of as Bayes' theorem. Thomas Bayes himself may have 
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had doubts about the validity of the application of his theorem. 
At any rate he withheld its publication and it was only after his 
death that it was found among his papers and communicated to 
the Royal Society by his friend Richard Price in 1763. Laplace 
incorporated it in his Theorie Analytique des Probabilites and 
used the theorem in a way which we cannot at this present time 
consider justifiable. However, since the theorem was given the 
weight of Laplace's authority, the validity of its application was 
assumed by many writers of the nineteenth century; we find 
famous statisticians such as Karl Pearson and Edgeworth de- 
fending it, and it still holds a prominent place in such elementary 
algebra text-books as have chapters on probability. Yet the 
modern point of view is that, strictly speaking, the application 
of the theorem in statistical method is wholly fallacious except 
under very restricted conditions. 

BAYES' THEOREM 

An event E may happen only if one of the set E v E 2 , . . . , E k , of 
mutually exclusive and only possible events occurs. The prob- 
ability of the event E t , given that E has occurred, is given by 



P{E t | E} is spoken of as the a posteriori probability of the event E t . 
Proof of Theorem. The proof of the theorem is simple. 

P{E. E t } = P{E} P{E t | E} = P{E t } P{E \ E t }, 
whence P{E t \ E} = P{E t } P{E \ E t }/P{E}. 

It is stated that E may only occur if one of the set E l9 E 2 , . . ., E k , 
occurs, and since these mutually exclusive events are also the 
only possible ones, 

P{E} = P{E. E^ + P{E. E 2 } + ...+ P{E. E k }. 

Each of the probabilities on the right-hand side may be expanded 
as before, for example, 

P{E. EJ = P{E l } P{E | EJ 

and the proof of the theorem follows. 

In the statement of the theorem we have written of a set 
E l9 E 2 , ..., E k of mutually exclusive and only possible events. It 
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is, however, possible to speak of them as a set of hypotheses, 
and for this reason Bayes' theorem is sometimes referred to as 
a formula for the probability of hypotheses and sometimes as 
a theorem on the probability of causes. 

However, no matter what the title of the theorem, it is clear 
that unless P{E t } 9 i.e. the prior probability of the event E t > is 
known, the application of the formula cannot be valid. Thus, if 
we regard the event E t as the hypothesis that the sample E 
has been drawn from one of a given set of populations, it is 
clear that the probability of this hypothesis will rarely be known. 
If the composition of the super population generating the set of 
populations is known, or, in the language of the theorem, if 
the prior probability of the event E t is known, then the validity 
of the application of the theorem is not in question; but we must 
then consider what occasion would arise in statistical practice 
in which it is necessary to calculate a further probability. 

If P{E t } is not known, it follows that some assumption regarding 
its value must be made before P{E t \ E} can be calculated. There 
is no reason why this assumption should not be made, but the 
fact which is often overlooked is that, given such an assumption, 
P{E t | E} will only be correct under this assumption and will vary 
according to the nature of the assumption. It has been customary 
to assume that all compositions of the populations E v E%, ..., E k 
are equally probable, and from this to draw inferences regarding 
the most probable composition of the population from which the 
sample has been drawn. It is legitimate to make this assumption, 
but if it is made then it should be stated that under the assump- 
tion that all population compositions (all hypotheses, all causes) 
are equally likely, the probability that E is associated with E t 
is a certain value. 

Possibly it is unnecessary to labour this point further for at 
the present time there are few adherents of Bayes' theorem. We 
shall consider some examples for which the application of Bayes' 
theorem is valid, and some for which we shall show that the 
probability will vary according to the original hypothesis 
regarding the populations. 

Example. Assume that there are three urns each containing 
a certain number of balls. The first urn contains 1 white, 2 red 
and 3 black balls; the second 2 white, 3 red and 1 black; and the 
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third 3 white, 1 red and 2 black. The balls are indistinguishable 
one from another, except for colour, and it may be assumed that 
the probability of drawing one given ball from any urn is 1/6. 
An urn is chosen at random and from it two balls are chosen at 
random. These two balls are one red and one white. What is the 
probability that they came from the second urn? 

If the urn is selected at random then all three urns are equally 
probable and we have 

P{E3 = P{Urn 1} = i P{E 2 } = P{Urn 2} = fc 
P{E 3 } = P{Urn 3} = J. 

The event E consists of the drawing of two balls, 1 white and 1 red, 
from either E l or E 2 or E%. 

number of ways in which 2 balls drawn 

p/ w I F \ from E l can be 1 white and 1 red 

* ' ~" total number of ways in which 2 balls 

can be drawn from E l 

_o 2* = 2 

6! 15' 

Similarly 

o| A\ o 21 4.1 1 

P{fi| = 6 Tr = -, 
By Bayes' theorem 



P{E,}P{E\E I } 

j=l 

and we have 



It is a little uncertain how such probabilities may be interpreted. 

Example. A box contains a very large number of identical 
balls one-half of which are coloured white and the rest black. 
From this population ten balls are chosen randomly and put in 
another box and the result of drawing 5 balls randomly with 
replacement from these ten is that 4 showed black and 1 white. 
What is the most probable composition of the ten balls? 

The probability that there are k white balls in the ten is, from 
the description ' very large ' population, 

10! 1 
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If there are k white balls in the population then the probability 
that 5 balls drawn from the ten with replacement will show 4 
black and 1 white will be 



io 

Applying Bayes' theorem and reducing, we have 



IOI/AW jfcy/ io!_ /AWi-lV 

!(10-A)!\10/ \ 10/ /A;=ifc!(10-Jfe)!\10/ \ 10/ ' 



Letting A; take in turn values 1, 2, 3, ..., 7, 8, 9 (we exclude zero 
because it is known that one white ball is among the ten, and 
10 because at least one black ball is known to be present), we 
may draw up the following table: 



k 


1 


2 


3 


4 


5 


6 


7 


8 


9 


I'{E k 


E} 


0-02 


0-11 


0-24 


0-30 


0-22 


0-09 


0-02 


0-00 


0-00 



The most probable composition of the ten balls is therefore four 
white and six black. 

Coolidge gives an interesting illustration of the effect of two 
different hypotheses in problems of this type where the composi- 
tion of the original population is not known. He propounds the 
following problem: 

'An urn contains N identical balls, black and white, in 
unknown proportion. A ball is drawn out and replaced n times, 
the balls being mixed after each drawing, with the result that 
just r white balls are seen. What is the probability that the urn 
contains exactly R white balls ? ' As in the previous problem 



but now the probabilities of the compositions of the population 
are not known. We cannot therefore apply Bayes' theorem unless 
we make some hypothesis about these probabilities. Coolidge 
suggests 

Hypothesis I. All compositions of the population are equally 
likely. 
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Hypothesis II. The population has been formed by drawing 
balls at random from a super-population in which black and 
white are of equal proportions. 

The student may invent other hypotheses for himself. 

For hypothesis I we are given 

P{E R }= 1/N-l, 

for we rule out the cases that all are white and that all are black. 
We have therefore 



and the most probable composition of the population, given 
hypothesis I, is R r 

N^n' 

In other words, the proportion observed in the sample is the most 
probable proportion in the population. 
For hypothesis II we are given 

N\ /1\ N 
P{ER} = Rl(N~^EJ 
which gives 

P{E H | E] = 



RI(N-'R)\\NJ \ N) I j^i R\(N-R)\\NJ \ N) ' 

whence by finding the value of R/N which maximizes this 
expression we deduce that the most probable composition of 
the N balls from which the n balls were drawn with replacement, 
is r> i -\T . 2r 



N~2 N + n' 

Thus by making two different assumptions regarding the com- 
position of the probabilities of the compositions of the population 
we are led to two different conclusions. If we take the population 
as N = 10, the sample as n = 5 and the number of white in the 
sample as r = 1 we shall have for the most probable composition 
of the population 

Hypothesis I -^ = . Hypothesis II = . 
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Both these results are correct if the original hypotheses are 
accepted but neither is correct if we limit ourselves strictly to 
the conditions of the problem which states that the composition 
of the population is not known. 

This example illustrates clearly the fallaciousness of Bayes' 
theorem as it is generally applied. We shall take the point of view 
that in general Bayes 5 theorem will not be applicable in statistical 
work. An exception to this, which will be discussed in a later 
chapter, is the application to Mendelian hypotheses. 

The statistical problem associated with Bayes' theorem is one 
which touches statisticians very nearly. Much of the working 
career of a present-day mathematical statistician is spent in 
attempting to draw valid inferences about a population when the 
only information available to him is that obtainable from a 
sample (or samples) drawn from that population. This need to 
draw inferences about the whole from the part has possibly 
always been felt by workers in probability, though not so 
markedly as to-day, and this may be the reason why the use of 
Bayes' theorem persisted for so many years; its inadequacy 
must have been recognized many times but it was used because 
no one could think of anything better. 

The objective of the method of confidence intervals, a statis- 
tical concept which was devised to overcome the impasse created 
by the too liberal use of Bayes' theorem, is the estimation of 
limits within which we may be reasonably sure that a given 
population parameter will lie; these limits are estimated from the 
information provided by the sample. For example, since we have 
discussed the binomial theorem in some detail let us suppose that 
it is desired to estimate p, the proportion of individuals posses- 
sing a certain character in a given population, the only informa- 
tion at our disposal being the number/ who possess that character 
in a sample of size n which has been randomly and independently 
drawn from the population. We should, possibly, for lack of 
anything better, take the proportion in the sample as an estimate 
of the population probability, but it is unnecessary to point out 
that this estimate will vary both according to the size of the 
sample and the number of samples available. 

If p, the population probability, is known, then by the binomial 
theorem the probabilities of obtaining 0, 1, 2, ..., n units 
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possessing the given characteristic in a sample of size n can be 
enumerated. If any positive fraction e is arbitrarily chosen, where 
< e < 1, two points f^n and f 2 /n can be found such that 

<e and Px 



whence 



1 



P{fi/ n ^ x ^/2/ 

for given values of p and n. If P{x \ p} had been a continuous 
function then it would have been possible to choose / x and / 2 so 
that each of the above probabilities was exactly equal to \e. 
Since, however, for the binomial probabilities P{x \ p} is dis- 
continuous, it becomes necessary to choose / as the nearest 
integer satisfying the inequality. 

If n is kept constant but different values are taken for 
p, < p < 1 , it will be possible to find/! and/ 2 to satisfy the above 
inequalities for each value of p chosen. It will therefore be 
possible, for one given value of n and one given value of e, to draw 
a diagram something like this: 



i-o 



o 

i 

o 

CO 




1-0 



Scale of x = f/n 



e is usually arbitrarily chosen to be equal to 0-05 or 0-01 and it 
follows from the construction of the diagram that knowing p, 
the population probability, and n, the size of the sample, we may 
read off the two fractions /Jw and/ 2 /ft and be confident that only 
once in twenty times (e = 0-05) or once in one hundred times 
(e = 0-01) would we expect the probability as estimated from the 
sample to fall by chance outside these limits. The curves will be 
different for different values of ft and e but they will all follow the 
same kind of pattern. 

Diagrams somewhat similar to these were first drawn by 
E. S. Pearson and C. J. Clopper. Their curves are reproduced 



Confidence belts for p (confidence coefficient = 0-95) 



1-0 



0-9 



0-1 0-2 0-3 



4 0-5 0'6 0*7 0-8 



Scale of fin 



Confidence belts for p (confidence coefficient = 0-90) 



0'2 0-3 0-4 0'5 0-6 0-7 0-8 



Scale of Jin 




9 1-0 



Probabilities a posteriori 79 

here by permission of the Editor of Biometrika. Although 
strictly they should have been drawn as a series of steps, as in 
the illustration above, the authors smoothed them by joining 
the outer points of the steps by a smooth curve. 

Before we pass on to consider the estimation of an interval 
for an unknown population parameter from these curves, we 
may perhaps mention one use of the curves which is sometimes 
overlooked. Suppose it is desired to test the hypothesis that 
a sample, in which the observed proportion of units possessing 
a given character is f/n, had come from a population in which 
the proportion was p. This type of problem may arise in the 
testing of Mendelian hypotheses when it is possible to calculate 
a priori what the proportion in the population should be. 

Example. From the mating of two individuals of genetical 
compositions AA and Aa the offspring must have the only 
possible genetical compositions AA and Aa by the Mendelian 
hypothesis. Among the 20 offspring from several matings of 
this type 14 were observed to be A A and 6 Aa. Is this consistent 
with the Mendelian hypothesis? By the Mendelian hypothesis 
(see next chapter), 

P{AA) = \ = P{Aa). 

The observed proportion of A A in a sample of 20 was 14/20 = 0-7. 
Reference to the confidence belt for n = 20, e = 0-05 at the point 
p = ^ shows that once in twenty times, owing to random errors, 
the sample proportion will lie outside the limits 0-25 and 0-75. 
In our case the observed proportion lies within these limits and 
we may say that there is nothing in the data to contradict the 
Mendelian hypothesis. 

The procedure carried through in the example does not differ 
from that which we have previously discussed in other examples 
on the binomial theorem, except that in this case the limits are 
already calculated. If the sample proportion had happened to 
fall outside the sample limits as given by the chart, then the 
argument would be that only once in twenty times would this 
be expected to happen through chance, that twenty to one are 
rather long odds and that the original Mendelian hypothesis may 
not be tenable. At least the statistician would be justified in 
asking for a check of the original assumptions. 

In using the 0-05 or 0-01 level for the acceptance or rejection 
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of hypotheses it must be remembered that on an average of once 
in twenty times (or once in one hundred times) the hypothesis 
tested will be correct but that the arbitrarily chosen limits will 
reject it as untenable. Obviously it will not be possible for the 
statistician to predict when such an occasion will arise and it will 
be necessary for him therefore to balance the sensitivity of the 
limits against the risk of making a false decision. In a lifetime of 
statistical work he will run the risk of averaging one in twenty 
wrong decisions if he always chooses the 0-05 level as his criterion. 

This use of the confidence curves for the testing of hypotheses 
is not all -important, however, because the population probability 
is rarely known; in fact where Bayes' theorem is not directly 
applicable for purposes of estimation then no direct hypothesis 
can be tested. The use to which the confidence curves are most 
often put is to estimate an interval which will include the unknown 
population parameter 95 (or 99) times in 100. The confidence 
belts were constructed by considering all values of p between 
and 1 and all possible values of x ( = //ft) for each value of p. 
Consider a function (f>(p) which will have the property that the 
value of (j)(p) at the point p = a will be the probability that 
p = a. This function <p(p) we may call the a priori elementary 
probability law of p and we note that it is unknown. 

The probability that any given point (x, p) will lie within the 
confidence belt, for a sample of size n, is 

P{(x,p) \n} = Probability that (x,p) lies within the confidence 
belt for n 

x<fzln 

P{x\p}. 



Allp x>filn 

Owing to the way in which the confidence belts were constructed 
we shall have 



Allp 

Hence the probability that any pair of values (x,p) will lie 
within the confidence belt is greater than or equal to 1 e, e being 
the small positive fraction at choice. It follows that if we make 
the statement that the pair of values (#, p) will always lie within 
the confidence belt we shall be wrong in making this statement 
on a proportion e of occasions. 



Probabilities a posteriori 81 

Suppose now that we have an observed proportion x = f/n 
and it is desired to estimate confidence limits for the unknown 
population probability, p. This may be done directly from the 
confidence belt. The abscissa is x and if an ordinate is drawn 
through x cutting the confidence belt at p l and p 2 , then 



and p l and p% are called the confidence limits for p. Because p 
is a population parameter it cannot vary, but the limits p l and^> 2 > 
dependent as they are on n and/, will vary from sample to sample; 
but however p l and p 2 may vary, in stating that the interval 
Pi * Pz w iH cover the true population value we shall be right in 
making this statement on a proportion (1 e) of occasions. The 
value e is at choice and is usually taken as O05 or 0-01. The 
statistician must balance the smaller interval for p if e is large 
against the increased chance that the interval will not cover the 
true value. 

We may note that because of the method of construction of 
the confidence belt the a priori distribution of p does not matter. 
Thus we have taken a step away from the restrictions of Bayes' 
theorem. 
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would like to read a defence of the theorem there is H. Jeffreys, The 
Theory of Probability, but it should be added that few statisticians accept 
Jeffreys' arguments. J. L. Coolidge, An Introduction to Mathematical 
Probability has a stimulating discussion of Bayes' theorem as has also 
J. V. Uspensky, Introduction to Mathematical Probability. R. A. Fisher, 
'Uncertain Inference', Proc. American Academy of Arts and Sciences, 
LXXI, no. 4, gives interesting criticisms of Bayes' theorem and develops 
his own theory of fiducial inference published some years previously. 

I have not touched on Fisher's theory in this chapter. We may note 
that he develops a theory which differs to a certain extent from that of 
Neyman and that both theories have their protagonists. Neyman first 
put forward (in English) his theory of confidence intervals in J. Neyman, 
'On two different aspects of the representative method', J. R. Statist. 
Soc. 1934, and extended it later in 'Outline of a theory of statistical 
estimation based on the classical theory of probability', Phil. Trans. A, 
ccxxxvi, p. 333. 

It is for the student to read both Fisher and Neyman and to make 
up his own mind which theory he prefers to adopt. 
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CHAPTER VIII 
SIMPLE GENETICAL APPLICATIONS 

It is perhaps surprising that the field of genetics has not made 
a more universal appeal to writers on probability. The hypotheses 
governing the simpler aspects of inheritance appear to be clear- 
cut and it is intellectually more satisfying to apply the funda- 
mentals of probability to a subject which is of practical im- 
portance rather than to follow the orthodox procedure and 
discuss the hazards of the gaming tables. There are many 
text-books on genetics in which probability applications are set 
out, and this present chapter does not pretend to instruct the 
serious student of genetics; what is attempted is to give the 
student of probability a small idea of how the elementary 
theorems of probability may be of use. 

The simple Mendelian laws of inheritance postulate the 
hypothesis that there are 'atoms' of heredity known as genes. 
These genes are associated in pairs and an offspring from the 
mating of two individuals receives one gene from the pair from 
each parent. Thus if we write AA for a pair of dominant genes, 
and aa for a pair of recessive genes, the genetical composition 
of the offspring of the mating of A A x aa can only be Aa. Such 
a genetical composition will be spoken of as a hybrid. From such 
simple assumptions it is possible to specify the probabilities of 
any type of genetical composition arising in the offspring of any 
given mating. If we write X l and X 2 for the genetical composition 
of the parents and Y for that of their offspring, we shall have, 
considering one pair of genes only, the following alternatives. 

(i) X l = AA, X 2 = AA. P(Y = AA | X l = AA, X 2 = AA} = 1, 

P{Y = Aa | X l - AA, X 2 = AA} = 0, 
P{Y = aa \X l = AA, X 2 = AA} = 0. 

(ii) X l = Aa, X 2 = AA. P{Y = AA | X^ = Aa, X 2 = AA} = , 

P{Y = Aa | X l = Aa, X 2 = AA} = , 
P{Y = aa | X = Aa, Z 2 = AA} = 0. 
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Similarly for X l = AA and X 2 = Aa. 

(iii) X l = aa, X 2 = AA. P{Y = AA | X l = aa, X 2 = AA} = 0, 

P{Y = Aa | Z, = aa, Z 2 = AA} = 1, 
P{7 = aa | X^ = aa, JC 2 = AA} = 0. 

Similarly for X l = AA and X 2 = aa. 

(iv) X l = Aa, JC 2 = Aa. P{Y = AA | X^ = Aa, JT 2 = Aa} = J, 

P{7 = Aa | X l = Aa, X 2 = Aa} = J, 
P{7 = aa | Z x = Aa, X 2 = Aa} = J. 

(v) X l = aa, JT 2 = Aa. P{7 - AA | X l = aa, JT 2 = Aa} = 0, 

P{7 = Aa | X l = aa, X 2 = Aa} = |, 
P{7 = aa | X^ = aa, JC 2 = Aa} = f 

Similarly for X l = Aa and JC 2 = aa. 

(vi) X l = aa, X 2 = aa. P{7 = AA | X l = aa, Z 2 = aa} = 0, 

P{7 = Aa | X l = aa, JT 2 = aa} = 0, 
P{7 = aa \X l = aa, X 2 = aa} = 1. 

These results follow directly from the application of elementary 
probability theorems. The probabilities thus obtained are some- 
times spoken of as the Mendelian ratios. 

The study of the inheritance of a particular pair of genes in 
a population is often rendered difficult by the fact that there is 
a selective factor in mating of which it is necessary to take 
account. Karl Pearson discussed this ' coefficient of assortative 
mating' for human populations and there is no doubt that it 
obtains for many animal populations also. In fact, it is difficult 
to think of any population in which it is reasonably certain that 
the mating is at random and is not affected by the genetical 
composition of the parents. The process of random mating is 
styled Panmixia, and we shall discuss a simplified form of this 
process. It may be questioned, since random mating is rarely 
met with in practice, whether it is worth while discussing. From 
the point of view of applying the theory of probability to genetical 
material it possibly is not, but from the point of view of under- 
standing the application of probability theory to genetical 
theory the study of Panmixia will not be without value. 

6-2 
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Assume that in a given population the proportions of males 
who are dominants (AA), hybrids (Aa) and recessives (aa) are 
p l9 q l and r v where Pi + qi + r l = 1, and that the corresponding 
proportions for females are p 2 , q 2 and r 2 , where ^ 2 + g r 2 + r 2 * 
If it is further assumed that Panmixia operates, we may proceed 
to calculate the proportions of dominants, hybrids and recessives 
in the first, second and third filial generations. As before, write 
X l and X 2 to represent the genetical composition of the parents 
and Y for the genetical composition of their offspring. It follows 
then that the proportion of dominants in the first filial generation 
is given by 

P{Y = A A} = /'{(^ = A A) (X 2 = A A) (7 = A A)} 
+ P{(Xi = A A) (X 2 = Aa) ( Y = A A)} 
+ P{(X l = A A) (X 2 = aa) ( Y = A A)} 
+ P{(X l = Aa) ( X 2 = A A) ( 7 = A A)} 
+ PftXi = Aa) (X 2 = Aa) (7 = A A)} 
+ P{(X l = Aa) (X 2 = aa) ( F = A A)} 
= aa) (X 2 = A A) ( Y = A A)} 
- aa) (X 2 - Aa) ( Y = A A)} 
= aa) (X a = aa) ( Y = A A)}. 

Each of these probabilities may be evaluated from first principles. 
For example 

P{(X l = A A) (Z 2 = A A) (Y = A A)} 

- PK*! = AA)} P{(Z, = AA) | (X, = AA)} 

x P{(Y = A A) | (X x = A A) (JT 2 = A A)}, 

or, since random mating was assumed, 

P{(X l = A A) (X 2 = A A) (7 = A A )} 
= P{(.Y 1 = A A)} P{( X 2 = AA)} 

x P{(7 = A A) | (X l = A A) (X 2 = A A)}, 

whence, on substitution, we have 

ft 

P{(X l = A A) (Z t = A A) (7 = A A)} = Pl p 2 . 

If Pj be the total probability that Y is a dominant, then, by 
similar calculations for each individual term and substitution in 
the formula, it is found that 

P{Y = AA} = P x = ( 
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Similarly it may be shown that 

P{ 7 = Aa) = Q l = ( Pl + fa) (r a + fa) + (p 2 + fa) ( fl + fa) 
and P{7 = aa} = ^ = (^4- 



It is easy to verify that P l + Q l + ^ = 1. The proportions of 
dominants, hybrids and recessives in the first filial generation 
are therefore P l9 Q l9 B v 

We now assume that random mating again occurs, and 
calculate the proportions of dominants, hybrids and recessives 
in the second filial generation Z. By a process identical with that 
for the first filial generation it may be shown that 



= P{Z = Aa} = 



Again if W is the third filial generation then 

P 3 = P{W = AA} = 

= Aa} = 



= aa} = ( 2 

remembering that Pj + d + i?! = 1. It follows, then, that pro- 
vided the mating is always at random and no extraneous factors 
intervene, the genetical compositions of the population do not 
change in proportion after the first filial generation. The 
population after the first filial generation may be regarded 
therefore as stable genetically. 

Example. A breeder wishes to produce seeds of red flowering 
plants (AA or Aa). For this purpose he repeatedly performs a 
mass selection, consisting in the early removal from his fields 
of all plants with white flowers (aa) before they open. Thus he 
removes the possibility of any plants being fertilized by the pollen 
of pure recessives. Assume that the process of reproduction of 
plants left untouched in the field satisfies the definition of 
Panmixia and that in a particular year the percentage of plants 
removed because they would have flowered white was r = 4/25. 
Calculate the proportion of white flowers to be expected from the 
seeds of the plants left growing on the field. 
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Since repeated selection has been performed it may be assumed, 
if p, q and r are the proportions of dominants, hybrids and 
recessives respectively, that 



which give, on solution of the equations, 



for the proportion of dominants, hybrids and recessives in a stable 
population. The recessives are artificially removed, i.e. r = 0, 
and the proportions accordingly become 

, l-Jr 2Jr , A 

'-*&' -r+> r=0 - 

If this population now mates according to Panmixia the pro- 
portion of recessives, r l9 in the next generation from a population 
so composed will be 



r was given equal to 4/25, and therefore the proportion of white 
flowers (recessives), given by substituting in the expression for r l9 
is r 1 = 4/49, that is, the proportion of white flowers has been 
almost halved by a single selective process. 

The procedure set out in this example of artificially destroying 
a certain proportion of the population raises some interesting 
queries as to what will happen if the selection is carried out 
a number of times, and what the number of repetitions will need 
to be if the proportion of recessives is to fall below a given 
number. The whole problem of random mating and artificial 
selection can be represented geometrically. 

We have seen that in a genetically stable population 



from which *Jp + ^r = 1, 

which is a parabola. Also since p + q + r = I 9 p + r^l. The 
composition of a genetically stable population is therefore 
represented by that part of the parabola lying between the 
points (1,0) and (0, 1). 
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Consider a point A with co-ordinates (# , r ). If p Q and r represent 
the proportion of dominants and recessives in a population, then 
for Panmixia, as already proved, the proportions p l and r x of 
dominants and recessives in the first filial generation will be 



(0,1) 




(1.0) P 

(Not to scale) 

If p Q and r are constant, then the point with co-ordinates 
(Pi> r i) will b e given by the intersection with the parabola of 
the perpendicular to the line 

p + r = 1 

through A. If (p Q , r Q ) is a point on the parabola then (p l9 r) is the 
same point. Hence, if the mating in a population is supposed to 
be according to Panmixia the composition of the next generation 
may easily be found by geometrical drawing. 

We may now suppose that we have a population the composi- 
tion of which is genetically stable, and the co-ordinates of which 
on the parabola are (p, r). If all the recessives in this population 
are destroyed the proportion of dominants will be p/( 1 r) and the 
composition of the population will be represented by a point A' Q 
with co-ordinates (p/( 1 r), 0). It will be noted that the point A' Q 
is also the point of intersection with the abscissa of the line 
joining the points (0, 1) and A Q , (p,r). 

If the population represented by A ' Q is allowed to mate 
randomly, then, following the previous analysis, the composition 
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of the next generation will be given by the co-ordinates of the 
point of intersection with the parabola of a line at right angles to 

p + r = 1 
and passing through A' . The co-ordinates of this point A^ are 



The effect of applying selection to the population represented by 
A l9 and then allowing it to mate randomly, can be seen by 



(0,1) 




4 (1.0) P 

(Not to scale) 

following the same process; A l is joined to (0, 1) and the point of 
intersection of this line with the abscissa is A[. The line through 
A( at right angles to the line joining (1,0) and (0, 1) gives the 
composition of the next generation in its point of intersection 
with the parabola. Suppose this process is carried out n times. 
It is easily shown that the co-ordinates of A n and A' n are 



. r/i + (n-l 

LI i+nj 



ri + (n-l)Vr -I 

n ' |_i + (+i)V- J* 



The proportion of hybrids may always be found from the relation 

p + q + r = 1. 
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If the selection process is carried out n times, then it is seen from 
A n that the proportion of recessives in the nth filial generation will 
be 



Example. If the proportion of recessives was 4/25 in the 
original population, how many times will the selection process 
need to be carried out in order that the proportion of recessives 
in the last generation should be less than 0-01? 

Here r = 4/25, and it is required to determine n such that 

,<0-01. 
n must at least equal 8, that is at least 8 selections must be made. 



CORRELATION BETWEEN THE COMPOSITION 
OF SIBLINGS 

It will be assumed that there is a population which is genetically 
stable and in which the mating is random. The proportions of 
dominants, hybrids and recessives arep, q and r for both male and 
female. By an application of the fundamental probability laws, 
as at the beginning of this chapter, the probability of a pair of 
offspring having given genetical compositions may be calculated, 
and hence the correlation between the genetical compositions of 
two offspring. 

If the population is genetically stable, then, as before, 



If X l and X 2 are the two parents and F x and Y 2 the two offspring, 
then 



= P{(X 1 = AA) (X z = A A) (Y l = A A) (7 2 = A A)} 
+ PftXj = AA) (X t = Aa) (li = AA) (F, = A A)} 
+ P{(X l = Aa) (X 2 = AA) (li = A A) (F, = A A)} 
+ P{(X l = Aa) (X t = Aa) (Y 1 = AA) (Y z = A A)}. 

The other possible matings can be neglected because they could 
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not produce offspring the genetical composition of which was 
A A. After expansion of the probabilities, for example, 

P{( X, = Aa) (Z, = AA) ft = A A) (Y z = AA)} = P{(Z X = Aa)} 
x P{(X Z = AA)} .P{(li = A A) | (Z x = Aa) (Z, = AA)} 
x P{(7 2 = AA) | (Zj = Aa) (Z, = AA)}, 

it is found that 

PM = AA)(7 2 = A A)} = i(l-V/f(2- v V) 2 . 
Similarly 

i = AA)(7 2 = Aa)} = (l-J r )*(2-Jr), 



x = Aa)(7 2 = aa)} = (l- 



Accordingly a correlation table (given on p. 91) may be drawn 
up, each cell of which will be the joint probability of 7! and 7 2 
having two given genetical compositions. 

If A A, Aa and aa are arbitrarily assigned values, 1, and 1, 
the table of probabilities can be treated as a correlation table and 
the correlation coefficient between the genetical compositions of 
Y l and 7 2 worked out. The total 'frequency' in the table is unity. 

Hence 



and easy algebra will give that p, the correlation coefficient 
between the genetical composition of the offspring, is 0-5. The 
correlation between the genetical composition of the offspring 
will thus appear to be independent of the original proportions in 
the population. 

Exercise. Assume that there is a population which is genetically 
stable and in which the mating is random. Let the proportions 
of dominants, hybrids and recessives be p, q and r for both male 
and female. Show that the correlation between parent and 
offspring is 0-5. 
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ELIMINATION OF RACES BY SELECTIVE BREEDING 

In Panmixia we have discussed the problem of the elimination 
of recessives in a population by destroying the recessives of each 
successive generation. We shall now study another aspect of the 
problem whereby genes carrying undesirable characteristics are 
eliminated by purposively mating individuals with others having 
a different genetical composition. This type of race improvement 
is an everyday practical problem, particularly in cattle breeding, 
where, by careful choice of bulls to serve the herd, a farmer may 
convert a herd of parents with an indifferent milk yield into 
a herd of descendants with a good milk yield. Let 



T n T n 



denote n pairs of genes which it is desired to eliminate from a 
race, r e . No assumption is made whether these genes are domi- 
nant, hybrid or recessive. Further, let 

RiRi, R2R2> RsRa? > RnRra 

denote n pairs of genes belonging to an individual of race R^ 
It is desired to introduce these n pairs of genes into r e . Individuals 
of RI and r e are mated. The genetical composition of the first 
generation must be 

Fl = RiXr e = R l r i> R 2 r 2> "> R n r n> 

since the offspring will receive one gene of each type from each 
parent. Now mate the t\ generation with individuals of the R 
parent race. Individuals of the F 2 (second) generation will receive 
one of a pair of genes from R i and one from J^. The gene from j^i 
may be R or r. Thus 



where X may be R or r. Suppose that this backcrossing is carried 
out (5+1) times so that 



Let P%*ji denote the probability that an individual of the 
(s+ l)st generation will possess exactly k genes of the n genes of 
type r that it is desired to eliminate. Further, let p t (s+ 1) be the 
probability that an individual of the F M generation will possess 
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a gene of type r t . From the law of inheritance of genes it follows 
that 



p t (l) is by definition the probability that an individual of the first 
generation will possess a gene of type r e . We have seen that it is 
certain that this will be so and therefore 



This result is independent oft and will hold for any pair of genes. 
The probability that an individual of the F s+l generation will 
possess exactly k genes of type r will be, from the binomial 
theorem, . / , \ k / , x n _ k 

^1 /lWl-M 

n-*)!\2V \ 2V * 



it is desired, if possible, to eliminate the genes of r e completely 
and we are therefore concerned with the case of k equal zero, 
that is, with the probability that an individual of the JP +1 
generation will possess no genes of the type to be eliminated. 
This probability will be 



As s increases without limit 



irrespective of the number of genes n. 

Example. If n = 12, what is the smallest value of s in order 
that the most probable composition of an individual of F 8+l will 
be the composition of an individual of R t ? 

It has been proved for the binomial that if the greatest term 
of the expansion (q+p) n is at the integer & , then 



In this present example it is required that k should be zero and 
hence it will be necessary that 



i.e. that 1 

from which it follows that s must equal 4. 
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Example. If n = 6, show how the distribution of P|"i changes 
as s increases. From theory 

6! 



ps+l _ 

L& If 



.k 



and the problem reduces to that of calculating a number of 
binomial probabilities. 

Table of P*+l for n = 6 



\* 

8 \ 





1 


2 


3 


4 


5 


6 


1 


0-015,625 


0-093,750 


0-234,375 


0-312,500 


0-234,375 


0-093,750 


0-015,625 


2 


0-177,978 


0-355,957 


0-296,631 


0-131,836 


0-032,959 


0-004,395 


0-000 244 


3 


0-448,795 


0-384,681 


0-137,386 


0-026,169 


0-002,804 


0-000,160 


0-000,004 


4 


0-678,934 


0-271,574 


0-045,262 


0-004,023 


0-000,201 


0-000,005 





5 


0-826,553 


0-159,978 


0-012,901 


0-000,555 


0-000,013 








6 


0-909,830 


0-086,651 


0-003,439 


0-000,073 


0-000,001 









Example. What is the smallest number of backcrossings 
necessary in order that the probability that an individual of the 
(s + l)st generation possessing no genes of type r shall be at least 
equal to 0-99? Assume n= 12. 

It is required that 



i.e. that 



12 



In order to satisfy this inequality s must be at least equal to 1 1 . 



BAYES' THEOREM AND MENDELIAN HYPOTHESES 

In a previous chapter the conditions under which the applica- 
tion of Bayes' theorem is thought to be legitimate have been set 
out, and it was stated that these conditions were nearly always 
fulfilled for Mendelian hypotheses. It is proposed now to illustrate 
by means of examples the application of the theorem in this case. 

Example. From the mating of two dominant-looking hybrids, 
Aa x Aa, a dominant-looking offspring is obtained of composition 
Ax, x being unknown. This individual is mated with another 
hybrid and as a result of this mating n individuals are obtained, all 
of which look like dominants. What is the a posteriori probability, 
that x = A? 
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From the mating of the parent hybrids we may obtain an in- 
dividual the genetical composition of which is aa, Aa, or A A. The 
first alternative is ruled out because we are told that the individual 
is dominant-looking. Let h^ be the hypothesis that x = A and h% 
the hypothesis that x = a. It is required to find the probability 
that the hypothesis h^ is true. Consider now the mating of Ax 
with another hybrid. If x = A we have, for a single offspring, Y, 

P{Y = A A or Aa} 

= P{Y = AA or Aa | X l = AA, X 2 = Aa} = 1. 
If x = a then 

P{7 = A A or Aa} 

= P{T = AA or Aa | A^ = Aa, ,Y 2 = Aa} = f . 

Hence the probability of obtaining n dominant -looking offspring 
under hypotheses h and h% will be 

P{n(AAor 
P{n(AAor 
The a priori probabilities of the hypotheses A x and h 2 will be 



for the possible offspring from the mating of two hybrids are 
AA, Aa, aA and aa and the last alternative is ruled out because 
the individual is dominant -looking. All the probabilities necessary 
for the calculation of probabilities by Bayes' theorem have been 
enumerated. Accordingly 

r Aa)} 



or Aa) 



and P{h z | n( A A or Aa)} = 



Let n = 4, P{^ | 4(AA or Aa)} = 0-61, 

P{h z | 4(AA or Aa)} = 0-39, 

and we should not be certain of either hypothesis unless more 
offspring from a further mating of a; with a hybrid were obtained. 
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Example. A certain pair of genes has the property that pure 
recessive individuals with the composition rr possess a certain 
defect X, while the dominants RR and the hybrids Rr are 
normal. Consider the following pedigree in which one in- 
dividual, C l9 possesses the inherited defect, X, all others 
appearing normal. 



> 3 X 



C 



C 2 and <7 3 intend to marry and it is required to calculate the 
probability that a single one of their offspring would possess the 
defect X. Assume that A 2 and U 4 were selected at random from 
a population of apparently normal individuals and that the 
probability that any one of these has a hidden gene r is 
p(r) = 0-001. (B.Sc. London, 1937.) 

It is stated that (7 X is rr and that both B l and B 2 are normal 
individuals. It follows that both B l and B 2 must have the 
composition Rr. Further, since A l and A 2 are normal, then either 
A l or A 2 must have the composition Rr. Let A l be the parent 
who passed the r gene to B 2 . 

The offspring from the mating of B l and B 2 can have the 
compositions RR, Rr or rr. C 2 is normal and cannot be rr and 
we have therefore 



This completes the left-hand branch of the pedigree. 
Let us now consider the right-hand branch. 

P{J5 3 = RR} = P{(Ai = Rr) (A 2 = RR) ( 3 = RR)} 

+ P{(A l = Rr) (A 2 = Rr) (B 3 = RR)} 

= P{A l = Rr} P{A 2 = RR} 

= RR) | (A v = Rr) (A 2 = RR)} 



x P{(B 3 = RR) | (A l = Rr) (A 2 = Rr)}. 



Simple Genetical Applications 97 

It is known that A l must be Rr and therefore 

P{A, = Rr} = 1, 

but what is the probability that A 2 is a dominant or a hybrid? 
We are told that the probability of an individual possessing a 
hidden gene of type r is 0-001. It follows therefore that if an 
individual such as A 2 or B is chosen at random from the 
population, 

P{A 2 = Rr} = 0-001 and P{A 2 = RR} = 0-999. 

Hence 

P{B 3 = RR} = 0-999 x i + O-OOl x J = 0-49975. 

Similarly P{B 3 = Rr} = 0-50000 

and therefore P{B 3 = rr} = 0-00025. 

J? 3 is, however, reported as normal and accordingly the 
a posteriori probabilities will be 

P{B 3 = RR} = 0-499875, 
P{B 3 = Rr} = 0-500125. 

The a priori probabilities that <7 3 is a dominant or a hybrid may 
be calculated in the same way as for JS 3 . 

P{0 3 = RR} = P{(B 3 = RR) (B t = RR) (C 3 = RR)} 
+ P{(B 3 = RR)( 4 = Rr)((7 3 = RR)} 
+ P{(B 3 = Rr) ( 4 = RR) (C 3 = RR)} 
+ P{( 3 = Rr) (J5 4 = Rr) (C 3 = RR)}. 
Expanding and substituting numerical values we have 

P{C 3 = RR} = 0-74956 
and by a similar process 

P{C 3 = Rr} = 0-25031, P{C 3 = rr} = 0-00013, 

whence the probabilities a posteriori for C 3 can be deduced to be 

P{C 3 = RR} = 0-74966, P{C 3 = Rr} = 0-25034. 

Let Y be the offspring if <7 2 and C 3 marry. It is required to find 
the probability that Y will possess the defect, i.e. that Y = rr. 

P{Y = rr} = P{(C Z = Rr) (C 3 = Rr) (Y ='rr)} = 0-042. 
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Exercise. Consider the genes R and r as given in the preceding 
example. Let j^i and F 2 be two consecutive generations of a 
population. Denote by q l the probability that an individual, 
chosen at random from the apparently normal individuals of F v 
will be a hybrid (Rr). Assume that the matings of apparently 
normal individuals are at random. 

(i) What is the probability, g 2 , that an apparently normal 
individual, Y, of F 2 whose parents are externally normal, will be 
a hybrid? 

(ii) What is the probability, P{(Z l = Rr) (Z 2 = Rr) | n}, that 
the apparently normal individuals Z l and Z 2 will have the 
compositions Z l = Rr and Z 2 = Rr when it is known that their 
offspring, n in number, are all externally normal ? 

(iii) What is the probability that an apparently normal indi- 
vidual, IF, of F 2 will have the composition Rr, given that his 
parents and (n 1) brothers are known to be externally normal? 

(iv) If the probability in (iii) is <7 2 ( n )> fi n( l ^he li m *t of ( fa( n ) as 
n tends to infinity. Put q l = 0-001 and n = 1, 2, 3, in turn and 
see how the knowledge that the parents and siblings of W are 
externally normal influences the probability that W has a hidden 
gene, r. (B.Sc. London, 1937.) 

REFERENCES AND READING 

Any elementary text-book on genetics will give the reader more 
genetical terminology than has boon assumed as known hero. Applica- 
tions of probability to genetical problems are spread widely through 
genotical literature. Wo may mention two books by K. Mather, Statistical 
Analysis in Biology and The Measurement of Linkage in Heredity, in 
which the student will find a number of biological problems treated 
statistically. Chapter ix of R. A. Fisher, Statistical Methods for Research 
Workers, may also be read with profit. 

The main ideas of the present chapter were obtained from lectures by 
Karl Pearson and J. Neyman. The interpretation of these ideas is the 
writer's own. 



CHAPTER IX 

MULTINOMIAL THEOREM AND SIMPLE 
COMBINATORIAL ANALYSIS 

Thus far in probability we have been concerned chiefly with 
fundamental probability sets the elements of which possess two 
alternative characteristics only; an event may happen or not 
happen, a ball may be black or white, and so on. No discussion 
of discrete probabilities would, however, be complete without 
some investigation of the case where an individual of the funda- 
mental probability set may possess one of several different 
characteristics. The binomial theorem gives a method for the 
calculation of probabilities when there are two alternatives; we 
now turn to the multinomial theorem which applies to cases in 
which more than two alternatives need to be considered. 

In stating that an element of the fundamental probability set 
may possess any one of k mutually exclusive properties, these 
properties being the only possible, we are formulating a general 
proposition a particular case of which might be that an event 
may happen in k different ways and so on. If the fundamental 
probability set is composed of N elements, N I of which possess 
the property A ly N 2 of which possess the property A 2 , ... 9 N k of 
which possess the property A k , where the N elements may be 
actual recorded happenings or a mathematical model, then p %9 
the probability that an element of the fundamental probability 
set possesses the property A i9 will be 

Pi = ~N C^ 1 ' 2 * *) 
by definition. 

Suppose now that n independent trials are made. The prob- 
ability that a single trial will result in an element being found to 
possess a given characteristic is defined, and we proceed, as in 
the case of the binomial, to ask, what is the probability that as 
a result of these n trials r^ elements will be found to possess the 
character A l9 r 2 the character A 2 , ...,r k the character A k l 

MULTINOMIAL THEOREM. An event may happen in k mutually 
exclusive ways which are also the only possible. The probability 

7-2 
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that in n trials the event will happen r x times in the first way, 
r 2 in the second, ..., r k times in the kth way is 



where p i is the probability that the event will happen in the ith 

k 

way in a single trial and ] p i = 1. 

1-1 

Since the k ways are mutually exclusive and are also the only 
possible, the event must happen in one of the given ways. If it 
were required to find the probability that it would happen in the 
first way for the first r^ trials, in the second way for the next r 2 
trials, and so on, the probability would be simply 



No order, however, is specified and it is necessary therefore to 
enumerate the number of ways in which r l9 r 2 , . . . , r k trials can be 
arranged subject to the restriction that 



= n. 
The expression (Pi+p 2 + +Pk) n * s ^ e product of 



by itself n 1 times, i.e. 

(P1+P2+-+P*)' 1 



Every term in the expansion of the left-hand side is formed by 
taking one symbol out of each of the n brackets of the right-hand 
side. Hence the number of ways in which any term p r f p$ . . . p r 
will appear in the final expansion will be the number of ways of 
arranging n symbols when r x must be p l9 r 2 must be # 2 > J 7 '* 
must be p k . 

This is the same requirement as for the arrangement of 
probabilities. It follows that the probability of obtaining r 
trials of the first kind, r 2 of the second and so on will be given by 
the complete term Pi l pt*...p r in the expansion of 
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and that this probability is 

n\ 

i i f P\ l Pz* Pk- 

r \ r 2 ! . . . r k \ ^ 2 * ^* 

The expression (Pi+p%+ ... +Pk) n may be spoken of as the 
generating function of the probabilities. 

When only two alternatives are possible, such as when an 
event may or may not happen, then 



and the probability that an event will happen exactly k times 
in n trials is 






k\(n-k)l"* ' 

as found in Chapter in. 

Example. A bag contains 5 white, 7 green, 12 red and 14 black 
balls. The balls are indistinguishable from each other except by 
colour. A ball is drawn and replaced, after its colour had been 
noted, on ten occasions. If any ball is as likely to be drawn as any 
other, what is the probability that of the ten balls seen 3 will be 
white, 3 green, 2 red and 2 black? 

10! /5\ 3 /7\ 3 /12\ 
Answer : : -: r I - 1 I -- I I I 

The binomial and multinomial theorems are, if equal prob- 
ability of all elements in the fundamental probability set is 
assumed or established, simple propositions which fit into a 
general mathematical scheme of arrangements generally known 
as combinatorial analysis. The ideas and theorems used in 
combinatorial analysis, as far as probability is concerned, are 
not new many of them were known to Laplace but they do 
not appear to be as well known as they should. We may discuss 
here certain simple aspects of this analysis both in relation to the 
theory of probability and in its application to statistical method, 
but it will be necessary first of all to define certain quantities and 
to state some of their properties. 
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DIFFERENCES OF ZERO 

If x l9 # 2 , . . . , x n are a scries of numbers, or the values of a given 
function at successive entries of the tabled argument, then it is 
conventional to write 



x l 



Ax, 

A.i* 2 

A 2 .r 2 

A.r 4 



A.r 3 



A 2 .r 3 



where Ao^ = x 2 x v A 2 .*^ = A.<- 2 A.r l5 A^CJ == A 2 x' 2 A 2 ^, and so 
on. The differences associated with x : are often spoken of as the 
leading differences. 

If #! = a: 8 , x z = (.+ 1)*, ...,' = (x + n l) s , then we have 

a" 



(*+!) AV 

AV 
" AV 



and further, if x is put equal to zero, 

s 

A(0) s 
1 A 2 (0) 

A(l) s A 3 (0) s 

2-" A 2 (l) 8 A 4 (0) s 

A(2) s A 3 ( 1 )* 

3 s A 2 (2) s A 4 (l) s 

A(3) 8 
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The leading differences A(0) s , A 2 (0), ..., A r (0) s are named dif- 
ference quotients of zero, or more often simply differences of zero. 
It is easily proved that 

A r (0)* = r! if r = 5, 
and A r (0) s = ifr>s. 

It is possible to show (see for example Milne-Thompson, p. 36) 
that the following recurrence relationship holds: 



A r (0)* = rAW 
or, if each side is divided by r !, then 
A r (0)* A^O)'- 1 



r\ ~ rl (r-l)T" 

It is curious, since these differences have been used by probabilists 
for over a century, that no table of them appeared before 1925, 
when Karl Pearson and E. M. Elderton tabled the expression 
A r (0) r + s /(r + s)! for values of r and s ranging by integer values 
from 1 to 20. W. L. Stevens, who was not aware of this earlier 
table, calculated A r (0) 8 /r! for r and s ranging by integer values 
from 1 to 25, in 1937. A straightforward relationship exists 
between the rth differences of zero and the Bernoulli numbers of 
order r, namely 



Generally B? = - r - S_ ( - ( - 

the first ten Bernoulli numbers obtained from this relationship 
being, after 



(63r 5 - 315r* + SISr 3 + 91r 2 - 42r - 16), 



104 Probability Theory for Statistical Methods 
(9r 5 - 63r 4 + lOor 3 + 7r 2 - 42r - 16), 

( 1 35r 7 - 1 260r 6 + ? 1 SOr 5 - 840r 4 - 2345r 3 - 540r 2 



34560 

+ 404r+144), 

r 2 
( 1 5r 7 1 80r 6 + 630r 5 448r 4 665r 3 + lOOr 2 

7680 v 

+ 404r+144), 
r 



-. (99r 9 - 1485r 8 + 6930r 7 - 8778r 6 - SOSSr 5 
V 



= . 

10 101,376 

+ 8 195r* + 1 1 ,792r 3 + 2068r 2 - 2288r - 768). 

When r = 1 those numbers reduce to the quantities usually 
designated Bernoulli numbers. It is possible, by the use of 
Bernoulli numbers of order r, to solve any probability problem 
where s r is small without recourse to tables. 



A SIMPLE SYSTEM OF ARRANGEMENTS 

Assume that there is a box B which is divided into N equal and 
identical compartments, k identical balls, where k^N, are 
dropped into the box at random and no restriction is placed on 
the number of balls which may fall into any one compartment. 
The problem will consist in enumerating the number of different 
patterns into which the k balls may arrange themselves among 
the N compartments. 

If it is equally likely that any one ball will fall in any one 
compartment, then the first ball will have a choice of N equally 
likely alternatives; so will the second, and the third, and so on, 
so that the total number of patterns in which the k balls may 
arrange themselves is N k . This total number of patterns will be 
the sum of the sets of different patterns in which the k balls may 
fall. One set of patterns will be when the k balls fall into k different 
compartments and k only. They cannot fall into more than k 
compartments because there are only k balls. Another set of 
patterns will be when k balls fill exactly k 1 compartments, 
which will imply that k 2 compartments contain one ball each 
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and one compartment has 2 balls, and so on. The different patterns 
may be enumerated in the following way: 

k balls in k compartments in c k . N(N 1 ) . . . (iV k+l) ways. 
k k-l c k _ l .N(N-l)...(N-k + 2) 



k k-t 



k 1 ^N 

c v c 2 , ...,c k are constants for a given k and depend only on k. 
That is to say, once the number of compartments to be filled is 
fixed (say k t), the distribution of the k t compartments 
among the N has been given and it is only left to enumerate the 
possible distribution of the k balls in the k t compartments. All 
sets of patterns have been enumerated above and hence 

N\ Nl 



A - 



Nl Nl 



It remains to evaluate the c's. This follows immediately by 
differencing* both sides of the equation t times for t = 1, 2, . . . , k 
and then putting N equal zero, whence 



A'(0)* 

c - ~ 



and we have 



(N-l)l 1! (#-2)! 2! '" 

Nl 



The enumeration is, however, not yet complete. We have shown 
that the total number of ways in which k balls may arrange 

* This is in effect an application of Gregory's theorem, viz. 



<w I <w I 



2!(n-2)l 
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themselves in t compartments is c t , as given above, but it is of 
interest to enumerate the different distributions of k within these 
t compartments. Thus we may wish to know in how many 
compartments there will be just one ball, and so on. This process 
is sometimes called enumerating the different partitions of k by 
t. An easy extension of the multinomial theorem, which gives the 
number of ways in which a x compartments contain just one ball, 
a 2 compartments just two balls, ..., a^ compartments just j 
balls, is 



where the restrictions are that 

1 1 

a i = t and 2 ' a i = & 

i=l i=l 

and some, but not all, of the a's may be equal to zero. The sum of 
this expression, taken over all possible partitions of k by t, will be 
equal to c t . 

Example. Five balls are dropped at random in 10 compart- 
ments. Given all conditions are equally probable, in how many 
ways can they arrange themselves in order to occupy 3 compart- 
ments and 3 compartments only? 

Here N= 10, k = 5, = 3. 

The number of ways will therefore be 

A 3 (0) 5 



I .10.9.8. 



It has boon shown that 

5! 



The total number of ways will be therefore 18,000. 

To enumerate these ways in detail we must discuss the 
different partitions of 5 by 3. These will be 

3 1 1 

2 2 1 
and we therefore have 

5! 5! 

rr,-=l<>, 
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which we note add together to make 25. Hence the number of 
ways in which 5 balls may occupy 3 out of 10 compartments, with 
3 balls in one and 1 in each of the two others, is 7200 ways, and 
with 1 ball in one and 2 in each of the two others is 10,800. 

Finally, since the balls are dropped at random and each ball 
is equally likely to drop in any given compartment we may 
calculate the probability for each partition. The total number 
of ways in which the 5 balls ma}/ distribute themselves in the 
10 compartments is 10 5 . Hence 

7200 



- 0-108, 

and the total chance that if the 5 balls are dropped at random 
they will occupy 5 compartments only is 0-180. 

Example. What is the most likely distribution of balls if 5 are 
dropped at random in 10 compartments? 

5 balls in 1 compartment in 10 ways 

5 2 compartments in 1,350 ,, 

6 :* 18,000 
5 4 50,400 
5 5 30,240 



Total 100,000 

Hence the most likely distribution is 5 balls in 4 compartments. 

The problem considered by Laplace was a little more compli- 
cated than the foregoing. He considered a list of n different 
numbers all of which had the same probability of being drawn. 
r of these numbers were randomly chosen. They were noted and 
returned to the population of n. He then discussed the probability 
that after i sets of drawings of r, q or more different numbers 
would have been seen. 

The number of distributions possible in a single set of r draw- 
ings is n \ 

r!(n-r)!' 

since all alternatives are given as equally probable. The number 
of distributions possible in i sets will be 



r!(n-r)! 
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The number of cases in which the number 1 will not be drawn will 
be given by excluding that number from the list of n. This will be 

(n-l)l \* 



/ (n-l)l \* 
\r!(n-r-l)!/ 



and therefore the total number of ways in which the number 1 
can be drawn is 



' )M <-J>! V . (l 
\(n-r)\} \r\(n r\)\) \r\ 



Following the same argument it can be shown that the number of 
ways in which 1 and 2 may be drawn is 



from which, by further extension of the argument, the number 
of ways in which q different numbers may be drawn is 

1V 

-g-l)...(n-r))< = C(n,r,i,q) (say). 



The probability that q different numbers will be seen in i sets of 
drawings will therefore be 



COROLLARY. If the probability is required that after i sets of 
drawings alln of the numbers will have been seen, then, writing 
t for the dummy variable to be put equal to zero after differencing, 

_, . , 
P(n, r, i, n) = 7 
v ' ' ( 

COROLLARY. If the number drawn on each occasion is 1, i.e. 
if r = 1, then the probability that after i drawings of one number 
all n of the numbers will have been seen is 
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Example. A court of discipline is drawn from 4 members, 2 of 
which are chosen at random for any one sitting. What is the chance 
that in six sittings all four will have served? 



= 12 6 , 

- 1)) 6 = A 4 (0) 12 - 6A 4 (0) n + 15A 4 (0) 10 - 20A 4 (0) 9 

+ 15A 4 (0) 8 - 6A 4 (0) 7 + A 4 (0) 6 , 

which may be evaluated from either Stevens' tables or those of 
Pearson and Elderton. The required probability is 0-94. 

Example. The committee of a learned society is 12 in number. 
One member retires each month and is replaced by a new person. 
If the retiring member is chosen randomly, what is the probability 
that after 12 months have passed, none of the members will then 
have served 12 months? 

We require 

A12/QU2 jot 

Pfl2 1 12 19\ * ' 
x-}i^, i, i^, i^j ^2! j2!2* 

The evaluation of a probability such as A r (0) s is not easy when r 
and s are both large. The tables of the difference quotients of 
zero extend, as we have already pointed out, to r and 8 = 25 
only. After these limits have been reached it becomes necessary 
to use an approximation to these differences if the probability is 
to be evaluated. 

Karl Pearson discusses two such approximations, one due to 
Laplace and one due to De Moivre, and remarks that De Moivre's 
approximation may be preferred over that of Laplace in that 
fewer approximations are involved and the formula is on the 
whole easier of application. The problem would appear to be the 
replacement of the series 



r(r-l)/ 2\ 8 r(r-l)(r-2) 



-I)/ 2\ 
\~\ r) 



3! 
by one which is easily summable. If we write 
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as did De Moivre, then 



For r large this approximation is adequate and enables the 
required probability to be calculated if the values of r and s 
fall outside the existing tables. For r small the Bernoulli 
polynomials of order r may be used. 

REFERENCES AND READING 

Discussion of the multinomial theorem is often vory restricted in 
probability text-books. There is usually much more in text-books of 
statistical theory and the student should read the derivation of the x 2 
distribution from the multinomial distribution. 

Enough has been given here for the student to understand what is 
meant by a difference quotient of zero. For those who wisli to take the 
subject a little further there is L. M. Milne-Thompson, Calculus of 
Finite Differences. A knowledge of this calculus is useful for many 
problems in combinatorial analysis. 

Applications of difference quotients of zero to probability problems 
will be found, among many other places, in P. S. Laplace, Thdorie des 
Probabilites (1812), Livre II, chap, n; K. Pearson, Introduction to 
Tables for Statisticians and Biometricians, Part n; W. L. Stevens, 
Ann. Eugen. vin, p. 57, 'Significance of Grouping'. 

For further reading in the theory of combinatorial analysis the student 
might begin with P. A. MacMahoii, Elements of Combinatorial Analysis. 



CHAPTER X 

RANDOM VARIABLES. ELEMENTARY 
LAWS AND THEOREMS 

During the preceding chapters attention has been confined to 
discontinuous or discrete probabilities. This restriction of the 
field is purposeful in that in outlining a new subject it is simpler 
for the reader to understand if the sets of points which are 
discussed are denumerable. All fundamental theorems, however, 
relating to the addition and multiplication of probabilities do 
not state explicitly that the discontinuous case is being con- 
sidered (except in the proof) and these theorems will be found to 
apply for the case where there is continuity or perhaps where 
there is a compound of continuity and discontinuity. 

The distinction between discontinuity and continuity will be 
preserved in discussing random variables. It is comparatively 
easy to prove all theorems relating to random variables when the 
variable is discontinuous. When the variable is continuous the 
same theorems may be shown to be true using the theory of sets. 
For the person interested primarily in statistical applications, 
however, it is often sufficient to prove the theorem for the dis- 
continuous case and to see intuitively that for the continuous 
variable the substitution of an integral for a summation sign will 
generalize the theorem. 

DEFINITION, a; is a random variable if, whatever the number a, 
there exists a probability that x is less than or equal to a, i.e. 
if P{x^a] exists. 

This is quite a general definition. Consider the case of a bi- 
nomial probability. 

[fril n \ 



The probability that k < k exists and k is therefore a random 
variable. If x is normally distributed, i.e. if 



then x is a random variable, normally distributed. 
* [fcj = the largest integer not greater than 
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DEFINITION. The elementary probability law of the discon- 
tinuous random variable a: is a function the value of which 
corresponding to any given value, say x = #', is the probability 
that x takes the value x' 9 i.e. 

p x (x') = P{x = x 9 }. 

When the probability law ofx is discontinuous x will be referred 
to as a discontinuous random variable and when continuous as 
a continuous random variable. 

DEFINITION. The integral probability law of a continuous 
random variable x is a function F(x) having the property that 

F(x) = P{a < x < /?} - ( f p(x) dx, 

J a 

where a and /? are any two numbers. p(x) is sometimes called 
the elementary probability law of the continuous variable x and 
sometimes its frequency function. 

n\ 
Example. P Utk = ^~^y-, P k q n ~ k 

is the elementary probability law of a discontinuous random 
binomial variable k. 



Example. p(x) = --- l - 4 - cxp [-1 ( 



may be spoken of as the elementary probability law of a con- 
tinuous random normal variable x. Its integral probability law 
will be i 



DEFINITION. Assume that a: is a discontinuous random variable 
which may take the mutually exclusive and only possible values 
u ly u& ...,u m . Let the elementary probability law of a: be written 
Pxfat) for i = 1, 2, ..., w. Then the mean value of x in repeated 
sampling or, in other words, the expectation of x is defined as 



* I do not like tlie growing English practice of writing the expectation 
of a? as E(x). E has passed into common use as a linear difference 
operator in the calculus of finite differences and there exists the possibility 
of confusion if the same symbol is used for expectation. I have followed 
the continental practice in using <. 
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ILLUSTRATION, k is a discontinuous random variable which 
may take values 0, 1,2, ...,n with corresponding probabilities 
P n . >P n tl , ...,P n . r , ...,P n<n . What is the expectation of fc? 



^ f 

' 



COROLLARY. If A; is a discontinuous random variable, so is k 2 
or any power of k, and the expectation of k 2 is 






DEFINITION. The expectation of a continuous random variable 

x is defined as / + > 

<j?(#) = x.p(x)dx, 

J -00 

where >(#) is the elementary probability law of # as defined. 

COROLLARY. If p(x) is the elementary probability law of 
a continuous random variable x, then the expectation of a 
continuous function of x, say/(#), will be 



= f 

J 



It is clear that the expectation of a function cannot always be 
evaluated. For example, consider the simple probability law 



= for x outside these limits 
and find the expectation of l/x. 

<?(-}= F-.l.dx = logx l 

W Jo* 6 o 

and this is infinite at the lower limit. Or again, suppose that the 
elementary probability law of the discontinuous random variable 
k is c _! 

p(k) = - for k = 0, 1, 2, ..., +00 



which cannot be evaluated. 

DPT 
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THEOREM. If # is a discontinuous random variable which may 
take values in ascending order of magnitude u^u^ ...,u m (these 
values being mutually exclusive and the only possible ones), 
with corresponding probabilities Pi,p^ >2 ) m > then 



u m . 



It is given that u v ^ n 2 < . . . ^ u m . 

By definition 

J 



m m 

*(X)= X UtfiZ V 

1=1 1=1 

Hence u ^ ffx ^ ii 



m 



and it follows that the expectation of x must lie between its 
greatest and its least values. 

THEOREM. If x is a continuous random variable whose 
elementary probability law is p(x), then the expectation of any 
bounded function, f(x), of x exists, and is contained between 
the upper and lower bounds of the function. 

A f unction f(x) is said to be bounded if there exist two numbers 
m and M such that < f<\ < ^ 

By definition 

o /*-h oo 

/(*') P( X ) d* ^ M P( X ) d x = -^ 



= f ' 

J 



00 



f j 

m v(x) dx = m. 



/* 

f I y^(.r) 

J-oo 



Hence m < ff(f(x)) < M. 

It has been tacitly assumed that/(x) is real. \if(x) is a complex 
function then it may be said to be bounded if there is a number 
M such that the modulus of this function does not exceed M . 
The expectation of the real and imaginary parts of the function 
may each be demonstrated to exist. 

Example. The integral probability law of a random variable x 
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Find 

(i) f(x), (ii) f(* 2 ). 



The mean value in repeated sampling of a random variable x 
which follows the normal probability law is thus seen to be the 
mean of the normal curve and the expectation of its square 



cr' 



Example. The integral probability law of a random variable 
x is 

1 Cf 
F(x) = P{a< x <fi}= f \ x' l e- x dx for 0<a^/?< +00, 

1 (/) J a 

where / is an integer. Find the expectation of x k . 



provided k is an integer, remembering the relationship 

i and 



DEFINITION. The relative probability law of a discontinuous 
random variable # 15 relative to other random variables 

^2' *^3> *' *^&> 

is a function the value of which for x l = u, given 

x 2 = v, x 3 = g, ..., x k = w, 
will be the relative probability that x = u, given 

a?a = v, ^3 = ^ ^ = ^> 
i.e. 



= P{(a?! = u) | (a: a - v), (x 3 
For two variables this reduces to 



This definition may simply (and obviously) be extended to meet 
the case of the continuous random variable. 



8-2 
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DEFINITION. If the random variables x l and # 2 a ?e independent, 
then 

Px, \ * 8 (^ 1 1>) = P{(*i = N) I (*2 = *)} = P{(S! = u)} = ^ (ti). 

THEOREM. The expectation of the sum of two discontinuous 
random variables is the sum of their expectations, whether the 
variables are independent or not. 

Consider two random variables x and y. x may take values 
u i>U2> >^ n > which are the only possible, with probabilities 
PvP*> > JV y ma y take values v v v 2 , ..., v w which are the only 
possible, with probabilities p'^p^^^p^ Let Py be the joint 
probability that x takes the value ^ while y takes the value Vp 

i.e. let P<, = P{* = u/)(y = *,)}, 

then 

n m n m n m 

*(x + y)= S S P(^ + ^)= S S P^+ S 2 Pt;,. 

i=Ul i-l; = l i=U = l 

The order of Ksummation is quite arbitrary and we may write 
therefore n m m n 



Now since P^ is the probability that x takes the value ^ while 

m 

y takes the value Vj, 2 P^- will be the probability that x takes 
the value u { while y takes any of the values v l9 v^ ...,v m . Hence 

m n 

V P.. = 7). V P.. = ' 

^-j - 1 t; ^t *-* A \) Pj' 

It follows that 



and the theorem is proved. 

THEOREM. The expectation of the sum of k discontinuous 
random variables is equal to the sum of their expectations. 

Let the k discontinuous random variables be x l9 x%, ...,%. By 
the preceding theorem 

A S xA=(* i + x 2 +...+x k ) = <?(x l ) + (x 2 + Xs+.. 

U-l / 

and therefore by continued application of the theorem 



/ k \ k 

*( X *,) = X *(*<) 

\ i=i / i-i 
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Example. Assume that there are k random variables 



and that 



= e ~ m< for* = 0,1, 2,. ..+00, and t= 1,2, ...,&. 



Find 

For any #< we have 






it *) - 



By the preceding theorem 

k \ fc fc 

S *(*<) = S 

i=l t=l 

The theorems regarding the sum of k random variables can be 
proved to hold for either continuous or discontinuous variables. 
The proof of the theorem will be assumed for the former case. 

THEOREM, x and y are two discontinuous random variables. If 
x is independent of y then y is independent of x. 

Let the joint probability law of xy be p xy , 

p xy (uv) = P{(x = it) (y = t;)} = P{(a: = ii)} P{(y = v}\(x = u}} 



x is given independent of y. It follows by definition 

P{(y = )} P{(a; = u) | (y = t;)} = P{(y = v)} P{(x = u)} 
and hence, from the expansion of the joint probability law that 

P{(y = )} = P {(y = V )\(X = U)}. 

Ifx is independent of y then it follows that y must be independent 
of #. 

THEOREM. The expectation of the product of two discontinuous 
random variables is equal to the product of their expectations 
if the variables are independent. 

Let the two random independent variables be x and y. Let the 
only possible values for x be u l9 u 2 ,...,u n , with corresponding 
probabilities Pi 9 p^"- 9 p n and for y, v l9 v 29 ...,v m , with corre- 
sponding probabilities p'vp'v ---iPm- Let P^ be the probability 
that x takes the value u t while y takes the value v^. 

Then S (x.y) = u^P^. 
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Now 

P it = P{(x = u t ) (y = v t )} = P{(x = *,)} P{(y = v t ) \ (x = *<)} 
and because given x and y are independent it must be that 



Substituting in the expression for the expectation of (x.y) we 
have 

71 771 71 771 



1=1 ; = 1 

Thus if the two random variables are independent, the ex- 
pectation of their product is equal to the product of their 
expectations. 

These two theorems, regarding the sum and product of 
expectations of two random variables, will hold good whether the 
variables are continuous or discontinuous. The theorem regarding 
the product may be extended, as before, to cover the use of k 
random independent variables. 

THEOREM. If x l9 x 2 , . . . , x k are k independent random variables, 
then the expectation of their product is equal to the product of 
their expectations. 

By repeated application of the theorem for two variables it is 
seen that 

\\ .r,) = X^sl 11 

i i / \ i-a 

and the theorem is proved. 

DEFINITION. The standard error of a random variable x is 
defined as 



Example. If k is a random variable having as elementary 
probability law the binomial law of probabilities, what is its 
standard error? 

It has been found previously that 

ff(k) = np, (&) = n(n-l)p* + np, 
where n and p have their usual meanings. By definition 



and hence &% = np(l p) 

as found in chapter iv. 
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Example, n random independent variables x v x 2 , ...,x n have 
the same probability law about which nothing is known except 
that the first two moments exist, viz. 

Find the expectation of the mean of the x's and its standard 
error. 

For any x it is given that 

*(*) = * 
Let x = - v X+. 

(i n \ i n 1 n 

- V X'\ = - v d?(x>] = - V f . 



Then * 

Thus the expectation of the mean of a sample, that is the mean 
value of the sample mean in repeated sampling, is equal to the 
population mean. Similarly 



i-l i-l j 

It will be noted that in finding the expectation of the mean of 
the n variables no use was made of the fact that the variables 
were given independent. Thus the fact that the expectation of 
the sample mean is the population mean is unaltered by 
dependence between the #'s. The same is not true for the standard 
error of the mean because the cross-products of the right-hand 
side of the expression immediately above can vanish only if 
the variables are independent. Given that the variables are 
independent it follows that 

n 



n S ^.--^(x 

i=l j-i 
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Hence the (standard error) 2 of the mean of a sample of n is 






a fact which is well known. 

This result is true for any n independent variables possessing 
the same probability law and for which 

*(*,) = and a\. = <r\ 

although it is most commonly made use of in the normal case. 
DEFINITION. The correlation coefficient, p { p between any two 
random variables x i and Xj is defined as 



where cr xi and <r x . are the standard errors of x i and x$ as previously 
defined. 

THEOREM. The standard error of any linear function 



of n random variables x l9 x 2 , . . ., x n is 

n nl n 

tfy= S a|crf+2 S S 
i=i i=i ;=i+ 

where the a's are constants, cr^ and cr ; . are the standard errors of 
x i and x j respectively, and p^ is the correlation coefficient as 
defined above. 

This theorem will be true for both discontinuous and continuous 
random variables. 



Then <%) = <f S a^ = S 

By definition 






from which, by expanding the bracket, 

i 1 n 



=i -i j= - +i 
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Applying the theorem that the expectation of a sum equals the 
sum of expectations and remembering the definition of the 
correlation coefficient we have 



<r}= 



which proves the theorem. 

Example. If 04 = a 2 = . . . = a n = - , a l = a 2 = . . . = a n = a, and 



if #1 = (T 2 = . . . = <r n = <r then y = ir, ^"(y) = a and 

0-2 20- 2 "- 1 

- + T?,?,^A 

If, further, the variables are assumed independent then 

cr} = cr*]n 

as found in a previous example. 

Example. Find the standard error of the sum and the difference 
of two random variables (i) if they are dependent, (ii) if they are 
independent. 

For the sum of two variables let 

# 3 = a 4 = ... = a n = 0, &! = a 2 = 1. 
Then y = x + x 2 

and 0-1 = 0-1 + 0-1 + ^o-^cr^p^. 

If ^i and x 2 are independent then 

0-2 = 0-1 + 0-1. 
For the difference of two variables let 

a 3 = a 4 = ... = oc n = 0, a x = 1, a a = - 1. 
Then y = ^ 1 x 2 

and erg = o-f + cr| - 2<r 1 <r 2 /> 12 . 

If #! and o: 2 are independent then 

o-J = crJ + crj. 

Exercise. Find the standard error of the sum of three random 
variables, ajj + a? a + x & ^ ^Q variables are dependent and show 
how this simplifies if they are assumed independent. 
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Example. A bag contains N balls. Each ball has a number 
stamped on it, the numbers being u^u^ ...,U N . It may be 
assumed that it is equiprobable that any one ball will be drawn 
and the probability of drawing u t is 1/N for i = 1, 2, . . . , N. From 
this bag n balls are drawn and no ball is replaced after drawing. 
What is the standard error of the mean of the n numbers thus 
drawn? 

Let the numbers drawn be x l3 x 2 , . . . , x n . It is required therefore 
to find <T, where 



The expectation of any given number will be 

11 1 1 N __ 

tf^^Uiy + UzX + '-'+Vxtf^y Zut = u (say). 

Hence the expectation of x will be 



This result is what might have been expected. The standard 
error of x is not, however, so easily intuitive. Consider first the 
standard error, cr t , of x t . 



From the definition of an expectation it follows that 

<r\ = cf(^-ti) a = ~ v (tt| .) = y v (say). 

^> -i 

V u , the variance of the t^'s, is constant. From the theorem on the 
expectation of a linear function we may write immediately 



The first summation on the right-hand side may be evaluated 
but it remains to calculate the double summation. 



p ij <r i cr j = 

by definition. Again appealing to the definition of an expectation, 
i.e. the summation of all possible values a random variable may 
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take multiplied by the probability that it takes each separate 
value, we have, treating the product as a unit 

A T -I x (N_ 2} 12' 

-^<))(^-^))J= X X (Hf-tOto-S)- --- vl ' 



The (standard error) 2 of x accordingly will be 



y 2 n ~ 1 w 



This expression may be simplified by the following device. It is 
clear that # 

v ( W| - S) = o. 



Squaring each side 

CN ~]2 AT 

X(-t) = S(^-^) 
^1 J <-l 

and therefore 






By substitution the (standard error) 2 of x will reduce to 



This last double summation is of a constant and it is only 
necessary therefore to enumerate the number of constants 
concerned. This will be f 



u si 2 V u /N-n\ 

whence cri becomes crj. = - - 1 ^~ - 1 . 
* x n\N-lJ 

Note (i). When n = 1 the expression for <r| reduces to the 
variance of the u'& which might be expected. 

Note (ii). If the x's were independent, that is if each ball had 
been replaced after being drawn and its number noted, then 



n 



from the expression for the standard error of a linear function. 
If this expression is compared with that for the standard error of 
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the mean when the drawings are made without replacement, it 
will be seen that the latter is less, provided n is greater than unity. 
THEOREM. Given that (i) x l9 x 2 ,...,x n , are n random 
independent variables, (ii) <(Xi) = Q>i for i = 1,2, ...,n, 
(iii) 6(x i -a i ) 2 = <r\ for i = 1,2, ...,^, then 



where a = ^(z) and x = - 2 #* 

ft i=i 

Possibly the simplest method of attack for this and similar 
problems is to employ the device of inserting the expectation of 
each variable within the bracket. Neglecting the factor l/n for 
the time being, write 



i=l 

The usefulness of this device is clear. In the expansion of the 
bracket a number of cross-products will appear. Provided x i and 
x t are independent, and they are so given, then 



and the algebra is accordingly simplified. The expansion of the 
bracket becomes 



i=l 

S 

i=l 



Using the theorem regarding the expectation of a sum it is seen 
that the evaluation of the first term is immediate. We need to 
consider the second and fourth terms. 



S (^-a<) = S 

t=l 



Also 



2 [(*< - i) (Z - 5)] = / S (^ - i) = .2 ^i - a,) 2 
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Hence 






or introducing the factor 1/n previously neglected 



COROLLARY. Suppose the n random independent variables x 
to follow the same probability law with mean a and standard 
deviation or. This might be the case for a sample of n individuals 
which had been randomly and independently drawn from such 
a population. In this case 

$(x.) = a- = a $(x cr-) 2 = o" 2 = cr 2 for =12 n 

and the theorem reduces to 

n-l 2 
n 

Now - X (#* #) 2 will b e recognized as the sample (standard 
n ^t= i 

deviation) 2 , s 2 . It follows then that the mean value in repeated 
sampling of the square of the sample standard deviation is not 
equal to the square of the population standard deviation, but 
in fact nl 

n 

If, therefore, the sample standard deviation is used as an estimate 
of the population standard deviation, in the long run the tendency 
will be to underestimate it. If it is desired to obtain a sum of 
squares which in repeated sampling will average out to be or, then 
it is clear from the equation 



that the factor l/(nl) should replace the 1/n of the sample 
(standard deviation) 2 . This new expression is not a sample 
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(standard deviation) 2 nor a population (standard deviation) 2 ; it 
is an expression which, averaged over a series of experiments, will 
approximate to the population (standard deviation) 2 . 

This elementary piece of algebra supplies the answer to the 
confused question * Do I divide by n or n 1 to obtain the 
standard deviation? ' If a measure of the scatter in the sample is 
required then the sample standard deviation 



/I n \* 

u,?, <'-*> 



must be calculated. If it is desired to estimate the population 
(standard deviation) 2 then the expression s 2 (n/nl) may be 
calculated because in the long run it will be equal to cr 2 . 

Exercise. Given (i) n independent random variables each of 
which has the same probability law, 

(ii) f( Xi ) = a, (iii) *(ff,-a,) a = cr 8 , 

1 n 
(iv) s 2 = - 2 (#i #) 2 calculate 

cr*, = <f(* 2 -^(5 2 )) a . 

[Note $(s 2 ) was found above.] 

Example. (Weldon's dice problem.} n white dice and n 2 red 
dice are shaken together and thrown on a table. The sums of the 
dots on the upper faces are noted. The red dice are then picked 
up and thrown again among the white dice left on the table. The 
sum of the dots on the upper faces is again noted. What is the 
correlation between the first and second sums? 

Let the numbers on the upper faces of the white dice be 
'n>'i2> >'ini> *h e Cumbers on the upper faces of the red dice 
at the first throw be t 2l9 t 22 , ...,< 2lll , and at the second throw 

let 



1 = 1 1=1 i=l 

It is required to find the correlation between t^ + t^ and ^ + 3. 
Consider first just one die. If t xi be the number of dots on its 
upper face after throwing, then 



and 
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Therefore tf x . = V - " = ft 

For the sum 

= IK + "2) 



and similarly (^ + 3 ) = f (7i x + ft 2 ). 

Applying the elementary theorems on expectations, and re- 
membering that any one die is independent of any other die, it 
may be shown that 

afi+fa = f IK 



If p is the coefficient of correlation between t L + 1 2 and ^ + f 3 , then 
by definition 



Replacing the individual sums on the right-hand side we shall 
have 



2 (ii-!)+ S (<,-! 

i=l il 



(ri! n a \"| 

S( w -t)+ L(-J) 
i=l i=l /J 



i=i 



The correlation coefficient between the two sums (t L + t. 2 ) and 
(^i + U is therefore 



It may be noted that this is a simple example of a more general 
case. If X and Y are two random variables, each composed of the 
sum of two random variables 



X = x + t, Y = 

then there will be a correlation between X and Y. 

DEFINITION. If a random variable x may take only the values 
zero or unity then x is defined as a characteristic random variable. 

If x is a characteristic random variable with probability p 
that it takes the value 1, and 1 p that it takes the value 0, then 
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A characteristic random variable may be seen, in this way, to 
have the interesting property that 

f(x) = <?(x 2 ) = ... = 
for 



and this will be true whatever p. 

Example. Consider a series of trials, n in number, in each of 
which the constant probability of a success is p. If with each 
of these trials is associated a characteristic random variable x 
which will take the value 1 if the trial succeeds and if it fails 

then 



(nx) = nf(x) = n(l.i> + 0.(l-jp)) = up. 

Example on expectations. In the previous chapter we have 
discussed the enumeration of the patterns in which k balls will 
fall when dropped randomly in a box of N compartments. It is 
easy to show by direct argument that the average proportion of 
compartments filled in repeated sampling is 



This result may also be achieved by application of the theorems 
of this chapter. The probability that exactly t compartments will 
be filled is 



t (N-t) 
Hence 

k _ _! * A'(0) fe N\ _1_ _ 1 * #A A < ~ 1 (0) fc (N-l)\ 

~ ' (*-!)! (N-t)\ 



~ N N k 

If we use the linear difference notation and write 

E= 1 + A, 
then E-W = (N-l)** 

A l\ fc 

and 



N 



* Ex, is defined as Ex, = x, +1 . Hence 

E N -H k = E*-*(t+ 1)* = E*~ 3 (t + 2)* = ... = (t + N- 1)*. 
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Similarly the mean value of (k/N) 2 in repeated sampling will be 

N\ 1 



5! (N-t)\N k 
-1)A(1+A)*- 



from which it follows that the variance of k/N, say cr|/ N is 



Exercise. Find the third and fourth moments about the mean 
in repeated sampling of the proportion k/N. 

REFERENCES AND READING 

Suitable exercises for most of the processes outlined in this chapter 
may be found in J. V. Usponsky, Introduction to Mathematical Probability. 

Again, W. Whitworth, Choice and Chance, gives a choice of many 
ingenious examples. 



CHAPTER XI 
MOMENTS OF SAMPLING DISTRIBUTIONS 

In the previous chapter there has been set out the mathematical 
technique whereby the expectation of a random variable, or of 
a function of random variables, may be calculated. One of the 
main uses to which the statistician puts this technique is for the 
calculation of the theoretical moments of sampling distributions. 
Such calculations are straightforward and are really only exercises 
on the use of the theorems already proved, but since they are 
of importance we shall consider them here in some detail. The 
connexion between the random variable of the probabilist and 
the sampling unit of the statistician is usually made in the 
following way. A single unit is randomly drawn from some 
population the probability distribution of which may be com- 
pletely known, or may be incompletely specified. With this single 
unit we associate a random variable which has the same prob- 
ability distribution as the parent population; this single unit will 
thus be one observed value of the given random variable. Hence, 
if we randomly draw a sample of units from a given population, 
we may associate a random variable with each element of the 
sample in order of drawing, and to find the mean value in repeated 
sampling of a function of the observed values it will only be 
necessary to discuss the mathematical expectation of the same 
function of the associated random variables. We shall begin by 
finding the moments of the sampling distribution of the means of 
samples the units of which have been randomly and indepen- 
dently drawn from an infinite population or, more precisely, 
from a finite population with replacement after drawing. In this 
latter case the population is effectively infinite for provided each 
unit is returned after drawing it is not possible to exhaust the 
population. 
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SAMPLING MOMENTS OF THE MEAN: 
POPULATION INFINITE 

Assume that a sample of n units is randomly and independently 
drawn from a population the distribution of which is not specified 
but the first four moments of which are known to exist. Let these 
population moments be /4, fa, fa and /^ 4 , where the /i's have their 
usual meaning. Associate with the sample units in order of 
drawing n random variables x l9 x 29 ,# n , and let 

1 

X = - 2J X; . 



It is required to find fi[(x), /6 2 (#) 

It has already been shown in the previous chapter, that 



and fa(x) = $(x (x))* = fi^n 

from which, if we apply the usual convention of writing /i 2 = cr 2 

we have that 

cr(x) == cr/^/n. 

The third and fourth moments follow in similar fashion but 
require a little more enumeration. 



"s . JS 



n ' 



The sample units, and hence the random variables, are given 
independent. The second and third terms therefore vanish and 
we have 
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For /*>t(x) similarly 



s K*<-/i) 8 (*, -/*; 



n 



{=1 



Again, because of independence, all the terms except the first and 
third vanish and 



^ 

/t' 



The fii(x) and /? 2 (S) of the distribution of the means will be 

j/x _/*!(*) -A 



and ^ 2 



> 

It is clear therefore that whatever the ^ and /? 2 f ^^e parent 
population (provided they exist), as n increases the fii(x) and 
y^ 2 (^) will tend to those of the normal population. If the parent 
population is known to be normally distributed then /? x = and 
/? 2 = 3 and therefore so do fi^x) and /? 2 (#)- 



SAMPLING MOMENTS OF THE MEAN: 
POPULATION FINITE 

While the concept of the infinite population presents no 
difficulties to the probabilist it is rare for the statistician to find 
a population which he could not count if he had sufficient time 
and patience. Also it is unusual for the statistician to be able to 
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sample with replacement. However, generally it is assumed, and 
it is often the case, that the population is large enough, and the 
sample small enough, for the sampling moments of the mean to 
be used assuming the population is infinite. We shall derive the 
sampling moments of the mean when the population is finite and 
the drawings are made without replacement, and the student may 
then judge for himself the degree of approximation involved. 

Assume that the population consists of N elements or units, 
and that the characteristic u we are considering takes values 
u v U<L,...,U N in the population. Let 

I N l N 

^ = Tr S U g , fa = jj 2 

IV 0=1 IV 0=1 

N 



Pt = -*r S K-/O 4 - 

^V 0=1 IV 0=1 

Suppose that a sample of n units is drawn from the population 
of N units, and associate with each unit of the sample, in order 
of drawing, a random variable. We have therefore n random 
variables x ly x 2 , ...,x n but they are no longer independent as in 
the previous case. We require to find 



In an example in the preceding chapter it was shown that 



The expansions for fa(x) and fa(x) will be the same as in the 
infinite case but when we come to take expectations the terms 
will no longer vanish because of the lack of independence. Thus 
we have that 



.-i/.'V 



S 

t-j+1 
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We shall consider the evaluation of the middle term in detail, 
and leave the reader to fill in the calculations for the last term. 
It is required to find 



x i and Xj are random variables, and we may make an appeal to 
the definition of an expectation and write 






The main difficulty of the problem rests in the evaluation of the 
sum on the right-hand side. Consider 

S K-/4) 2 S (**-/*!). 

0=1 7i = l 

This product is equal to zero, since the second sum is zero. 
Hence if we expand the summations we have 

o= s (/*,-/';) 
(/=i 

+ A S 1 

(/-i // 
and 

' 1 



It is clear then that if we consider 



together with ff(x i //i) (a; ; . /i[) 2 

we shall have that 



Again, by considering the product 

N N N 



K-K) = o, 

0=1 fc = l t?=l 
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it may be shown that 

6 X X X vfaifti) ( x j~'/ l i) (#/"- /*i) = 7~v fwAT" 5V ^ 3 ' 

i=l ; = i+l ^; + l (iV - - 

from which it follows that 



This simplifies to 



The calculations for ^(x) may be carried out along similar lines. 
Writing down the expectations it is clear that in order to 
evaluate the sums it will be necessary to consider the expansions 
of 



flr-1 



S K-/^i) 2 S (- 



' T - 



The reader should go through the algebra involved in order to 
get a facility in expansion and enumeration of the products of 
sums of variables. This algebra is quite straightforward and it is 
easy to show that 



2) 



n(n-l)(n-2)(n- 

l" /TIT 
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whence, collecting terms and rearranging we have finally that 



6(n-l)(n-2)(n- 



-3)\ 

-3)} 



_ 4. 3(n-l)(n-2)(n- 



y - 1 (N - 1) (N -2y(N-l)(N-2)(N 
If N is very largo compared with n it is clear that 



-3)\"l 

- 3)/ J ' 






i.e. fa(x) and /^ 4 (^) are very nearly the same as those obtained for 
the case when the population is infinite. 

We shall now go on to a discussion of the first two sampling 
moments of the (standard deviation) 2 , i.e. we shall find 



and 



FIRST TWO SAMPLING MOMENTS OF THE 

(STANDARD DEVIATION) 2 : POPULATION 

INFINITE 

As before it will be assumed that we consider a sample of n 
units randomly and independently drawn from some population 
of which it is known that the first four moments, /i[, fi 2 , /^ 3 , /^ 4 , 
exist. With each element of the sample we associate a random 
variable x { , (i = 1, 2, ...,n). 

In the previous chapter it was shown that 



that is to say, the mean value in repeated sampling of the sample 
(standard deviation) 2 is not equal to the population (standard 
deviation) 2 . We shall refer to this fact later. The process of 
obtaining the second moment of s 2 is perhaps a little difficult as 
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regards enumeration but the method is the same as those used 
previously both in this and the last chapter. We shall expand cr**, 



and carry out the enumeration of <f (s 4 ) in three stages. 






s 



(ii) *(n(Z-/.i)) = . 



n s 1 s 
1=1 ^ 

<I 



+24 W E "s n s s 



Because of the independence of the #'s we have then 



n 4 
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(iii) ^2n(J-/ei)S(^-/^) a 



2 r n "If 7 ' "I 

= <* s(s,-/o* xta-//;) 

w Li=i JLi i J 



S 

j-i-f-l 



whence 



I- I 



Substituting for these expressions in the expression for \s*) we 
have 



which gives us for cr^ 2 , remembering // 2 = //' 4 ///l> 



ESTIMATE OF POPULATION (STANDARD DEVIATION) 2 

In the previous chapter it was noted that the sample standard 
deviation can only be used in describing something about the 
sample and that as soon as it is necessary to estimate the standard 
deviation in the population it is desirable to consider not s 2 , but 
a quantity s 2 (say), where 



because the mean value in repeated sampling of s\ is equal to <r 2 . 
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The (standard error) 2 of s\ will be, from the immediately preceding 
analysis 



It will be noted for both s* and s* that 






(1) when n becomes very large 

<r 2 2 and crj; both ->^- 2 [/? 2 - 1]. 



(2) when the parent population is normal and therefore 
y? 2 = 3 we have that 2 r _ *V~I 

L. _J 

and when in addition n is large 

2// 2 



CT 2 2 and <7 2 2 both -> - . 
8 s * n 



This last expression is often useful when carrying out a rough 
test of significance in one's head. The expressions for /^(s 2 ) and 
J3 2 (sl) may be calculated in the same way as we have calculated 
o* 2 2, and the student should try to work these out. It may be 
shown, for the parent population normally distributed and for n, 
the size of sample, large, that 



FIRST Two SAMPLING MOMENTS OF s e : 
POPULATION INFINITE 

For the student who is not yet accustomed to the ideas of 
analysis of variance it may seem a little artificial that we have 
treated s 2 and 6* 2 in some detail, but have made no mention of s 
and s c which, being measures of scale, are collective characters 
which are used in the development of statistical theory from the 
very beginning. The reason why we do not treat of s and s e is 
not far to seek; the square root sign makes them difficult to 
handle mathematically. We shall find here the first two moments 
of the sampling distribution of s e but we shall do so in a very 
approximate way and rely for our justification on the fact that 
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the expressions obtained do agree well with numerical sampling 
experiments. By definition 



Write 



and assume Ss e is small by comparison with tr, which will be so 
if n, the sample size, is large. 

Expand the right-hand side as a series, and take expectations. 



The expectations within the bracket may be evaluated directly. 

*W) = or" 
from previous work, while from above 

*($ = o* 
from which it follows that 



= 0. 
Also (sl - o- 2 ) 2 = cr 2 2 = (y# 2 - 1) = 

This is certainly only true for large n but ifn is not large then the 
original assumption will not hold good either. Substituting for 
these expressions we have 



and hence cr, ===er /I I. 

* V \ 4 / 

When the parent population is normally distributed then /? 2 = 3 
and . 
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The student must beware of an indiscriminate use of these 
sampling moments of s e . Nevertheless, in spite of their mathe- 
matical limitations, they are useful if only because the distribution 
of s e tends, with n increasing, to be normal more quickly than 
does the distribution of s\. 

It is known that if the original population is normally distributed 
then (n 1) s\\a* is distributed as x 2 with n 1 degrees of freedom. 
We shall discuss the x 2 distribution as an example in the theory 
of characteristic functions but it is useful here as an exercise to 
find the first two sampling moments of % 2 , which we shall do in 
a quite general way, for the case of a grouped frequency distribu- 
tion. Some variation of our previous technique will be necessary, 
and we shall now make use of the concept of the characteristic 
random variable. This concept will be found useful in all sampling 
problems where it is necessary to consider groups or strata and 
the student should try to make himself familiar with its use. We 
shall assume that a sample of N observations is randomly and 
independently drawn from a population which may be classified 
into k groups. Suppose that the chance of an individual being 

k 

drawn from the ith group is p i for i = 1, 2, ..., fc, that Pi = 1 

and that the number in the sample actually drawn from the ith 
group is n % . It is obvious that 

k 



1=1 

Write dn { = n { Np t . 

We shall begin by finding <r 2 (<Jr^) and p(Sn i Snj), for 



and for convenience we shall consider only two groups, the ith 
and the jth, throughout. The argument will obviously hold good 
for any pair. Associate with each unit of the sample of N, in 
order of drawing, two series of independent characteristic 
random variables, OL I and $, for l,t = 1,2, ...,N. The series of 
variables a z will have the property that they will take the value 1 
when the sample unit falls into the ith group but that they will 
be zero otherwise. Similarly the variables p t will take the value 1 
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when the sample unit falls in the jth group and be zero otherwise. 
It is clear from this definition that 



N N 



Using only the variables a l we may find 

cr^n,) 
From definition 



Again from definition 

(TV \2 r JV JV~1 N "I 

Xi) =^1 Xaf + 2 S 2 !* 
i=,l / Lj=i 1=1 h**l+l J 

We assumed that the characteristic random variables were 
independent, whence the right-hand term will be the product 
of two expectations. 



and therefore ^(717) - A^ + ^V(^V - \)p\ 

and cr 2 (*n < ) 



Similarly for the correlation coefficient p(8n i 8n^). From the 
definition of the correlation coefficient we have 

p(8n Sn ) 

i J 



since the separate expectations are each zero. The denominator 
is already known and it remains to calculate 



Using now the two series of variables 



From definition it is certain that the first summation must be 
zero, i.e. A- 
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for when a = I,/? =() and vice versa. Hence 



and 

Substituting in the expression for p(8n i 8n j ) it is seen that 



These preliminary calculations will serve to make the reader 
familiar with the way in which the concept of the characteristic 
random variable is used, x 2 ma y be defined as 



and we must therefore consider (8n\), ^(8n\8n^) and (811$ if 
we are to find the first two sampling moments of this criterion. 
We have already shown that 



from which it follows that 



but the expectations of the higher moments of 8^ and 
may cause difficulty. 






The other cross products vanish because of the independence of 
the variables. The ^(cn l p^ comes immediately on expansion. 



and we obtain similarly S > (a l p^. Substitution of these values 
in the expression for $(8n\) gives 

= N Pi (\ -pi 
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Tihe^(8nl8n 2 j) can be calculated in a similar way, but it will be 
necessary to take care in the enumeration of terms. For the 
student who is not sure of himself when dealing with summa- 
tion signs it is perhaps better to expand 

ff(in\ Sn}) = &[(nt - NpJ 2 (n t - 

and consider each term separately. 
For example, 






^ S 

1=1 t^ 






JV-l 

4 



S S 2 (!/ A A + ii A A + ^c. (6 terms)) 
^t+i 

aa 



whence 



Similarly 



We have already evaluated ^(n^) and <*(n 2 j), so that on sub- 
stitution we have that 



From the definition of a (standard error) 2 it follows that 

- (*- I) 2 - 
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Substitute for 2 



(k ;^2\2 I k 

&) --fe 



The expectations of both these terms have been evaluated and 
it only remains therefore to substitute the values obtained and 
to show that 

1 k 1 7.2 



The algebra involved is not heavy but the student may perhaps 
get into difficulties if he does not resort to the by now familiar 
trick of getting rid of the double summation sign, e.g. 

* 



and 



k -12 fc k-l k 

S^i = Spf + 2^ S 

i=l J i=l i=i; = i+l 



A; 

remembering that 2 Pi = 1- 

1=1 

The usual values taken for the moments of x 2 are 



It will be noted that by taking these values we are neglecting 
terms of order l/N in the expression for cr^a. Generally this 
approximation is not important and the student may convince 
himself that this is the case by working out some numerical 
examples. 

Numerical examples 

Compare the true value and the approximate values of or^a for 
the cases: 

(i) k= 10, N = 20, Pi = > for i = 1,2, ...,10. 
(ii) k = 10, N = 50, pi = p w = 0-02, p 2 = p 9 = 0-04, 
p 3 = PQ = 0-08, ^4=^7 = 0-14, p 5 = ^ 6 = 0-22. 



DPT 10 
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REFERENCES AND READING 

There are many papers both in the journal Biometrika and elsewhere 
where the sampling moments of different distributions are derived. We 
may mention A. K. R. Church, Biometrika, xvn, p. 79; J. M. le Roux, 
Biometrika, xxui, p. 134; J. Neyrnan, Biometrika, xvn, p. 472; J. B. S. 
Haldane, Biomctrika, xxxin, p. 234, but the list is not by any means 
exhaustive. The student wishing further exercises on expectations should 
consult these and other papers and work through the algebra. 

For examples in the use of the characteristic random variable the 
student might consult J. Neyman, J. Amer. Statist. Assoc. xxxm, p. 101, 
'Contribution to the theory of sampling human populations', and the 
appendix to F. N. David, Statist. lies. Mem. n, p. 69, 'Limiting 
distributions connected with certain methods of sampling human 
populations'. 



CHAPTER XII 

RANDOM VARIABLES. INEQUALITIES. 
LAWS OF LARGE NUMBERS. LEXIS THEORY 

By application of the elementary theorems regarding the 
addition and multiplication of expectations most problems can 
be solved. It will have been noted that the theorems are quite 
general and do not depend for their application on the random 
variable following a particular probability law. Following along 
the same lines, and without specifying anything about x, other 
than that it is a discontinuous or continuous random variable, 
several inequalities have been devised which enable limits to be set 
for the probability of # being less than a given value. Most of these 
inequalities spring from, or are generated from, Markoff 's lemma. 

MARKOFF'S LEMMA 

It is assumed that a certain random variable x may take only 
positive or zero values. If a = S(x) and t is any given number 
greater than unity then 



Let x take values in ascending order 

< u < u 2 ^ u 3 < . . . < u n < u tl+l < ... <U N 

and let Pi,p 2 , >jPnPn+i> >? ) jv be the corresponding prob- 
abilities that x takes the given values. The proof may conveniently 
be divided into three parts. 

/. at 2 < u^ 

From the definition of expectation 

N N 

(x) = a = S ^iPi>u l S Pi = u l9 

i=l t-1 

Since t is an integer greater than unity, if a>u l9 then at 2 
a fortiori >u lt and cannot be less than u^. 

II. at 2 > U N 

If at 2 > U N then P{x ^ at 2 } = 0, which is certainly less than l/t 2 . 

10-2 
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III. u^ < at 2 < U N 

If at 2 lies between u v and U N then it must be possible to find 
two values of x, say u n and u n+l , between which at 2 will lie. 
Assume therefore that 



Writing down the expectation of x we have 

g(x) = a = u l p l + u 2 p 2 +...+ u n p n + u n+l p n+l +...+ u N p N 

a > ^n+iPn+i + + UNPN > o* 2 (lWi + - +PN) = o^P{x > at 2 }. 
It follows that P{x ^ at 2 } < l/t 2 . 

ILLUSTRATION, a; is a binomial variable with expectation 
equal to up. 

Let t = ^3. Then P{x ^ Znp] < . 

It will be seen from this that the limit set to the probability by 
the inequality is not very restrictive, but it must be remembered 
that the lemma will apply to any random variable about which 
the only thing known is its mean value. If the probability law of 
a variable is known then there is no need to calculate the prob- 
ability as given by Markoff 's lemma because the exact value for 
any required probability can be found. 

Finer limits can be obtained by the use of the Bienayme- 
Tchebycheff inequality which makes use of both the mean value 
and the standard error but again since this inequality will be 
applicable to any random variable which has a mean value and 
a standard error too much cannot be expected of it. 

BlENAYME-TcHEBYCHEFF INEQUALITY 

If x is a random variable of any distribution whatever, then 
provided &(x) = a and ^(x a) 2 = cr 2 exist and t > 1 



The proof of this inequality follows directly from the Markoff 
lemma. 

Write y = (x a) 2 . 

Then ff(y) = <f(z-a) 2 = a 2 . 
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Applying Markoffs lemma it will be seen that 

P{y > t*cr*} < 1/J 2 or P{(x - a) 2 ^ V 2 } < 1/t 2 
and therefore 1 - P{\ x - a \ > tor] > 1 - 1/i 2 . 
It is obvious that 

P{|a?-a| > tcr} + P{\x-a\<t(r} = 1 
and the inequality 



is proved. 

ILLUSTRATION. For illustration let us again consider the 
binomial variable, x, and further, let t = 3. 

(x) = up, (x-(x))* = npq 
and P{ | x Tip | < 3 <J(npq)} > f . 

The Bienayme-Tchebycheff inequality leads naturally to the 
mathematical Law of Large Numbers. This last is a name given 
to a series of theorems which although differing in their proofs do 
not differ radically in their conclusions. 

It may be shown for a broad class of linear functions, 

y = F(x l9 x*... 9 x n ) 9 

that the standard error of y tends to zero as n, the number of 
variables x, increases without limit. The simplest case of this will 
be when the x'& are all independent, when the standard error of 
each x has the same value, or, and when y = x. Then the standard 
error of y is cr/^/n and it is seen that this tends to zero as n tends 
to infinity. But the standard errors of each x need not necessarily 
be equal. The same result can be reached by supposing simply 
that all the x's are bounded. That is to say that there exists a 
certain number m, such that | x \ cannot exceed m, and therefore 
the standard error of y cannot exceed m/^Jn. 

Also the standard error of a linear function may tend to zero 
even if the x's are not all independent. This will be the case when 
each random variable x is correlated with the one which im- 
mediately precedes it and the one which immediately follows it, 
the others being independent or at least uncorrelated. Such 
successions of x'a were considered by Markoff and called chains. 
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If the standard error of any function y, of the n random 
variables x, tends to zero as n tends to oo, then the Law of Large 
Numbers will apply to the function, y. 

THEOREM. (Law of Larne Numbers.) If y is a function of n 
random variables, x l9 x%, ...,# n , and if 

= a and 



then provided (T 2 ->0 as n->oo, for any two positive numbers 
e and ?/, where e and rj may be as small as desired, it is possible to 
find a number n Q , such that for n>n 

P{\y-a\<c}>l-ij. 

The inequality of Bienayme-Tchebycheff applied to y will give, 
for any t greater than unity, 

P{\y-a\<crt}>l-l/t 2 . 
Write l/< 2 = T) 

and the inequality becomes 



Now, assuming cr 2 ->0 as n-^>ao, whatever the numbers e and ?/, 
where e and TJ are as small as desired, it will be possible to find 
a number n so large that if n > n Q then 



It follows therefore that 



and the inequality is proved. 

Example, m dice are thrown. If x is the mean of the sum of the 
dots on their upper faces find 



Let Xi(i = 1,2,..., m) be the number of dots on the upper face 
of the ith die. Hence 



From first principles it may be shown that 
<?* = and fx = 
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Also since the x's must be independent 

35 

O Q R 1 /.* 

ffft jji.0. QTlH *T* 

v/ rp i o Cl/llvl. L/ ^; _ _ 

* 12 * 12m 

If m->oo then cr,g->0. 

The probability it is required to calculate therefore is 



by an application of the Law of Large Numbers. As m increases 
this probability will tend to unity. It may be noted that for the 
probability to exist m must be greater than 1 1 . 

THEOREM. (Generalized Tchebycheff Inequality.) Assume that 
there are n independent random variables x l9 x 2 , ...,x n and that 

ff( Xi ) = a i9 g(x i -a^^rf, for i = 1 , 2, . . . , n. 
Then, provided t 2 <n> where t is a number at choice 



The proof follows directly along the lines of the two preceding 
theorems. 

THEOREM. (Poisson's Law of Large Numbers.) It is assumed 
that the probability for the success of an event varies from trial 
to trial. In n successive trials the successive probabilities are 
* --y- ^ there be k successes in n trials then 



n 9 



where t is a number at choice and t 2 < n. 

Assume that characteristic random variables x l9 x 2 , ...,x n are 
attached one to each trial. The proof of the theorem then follows 
from an application of the Bienayme-Tchebycheff inequality. 

The above theorems are only a few of the many which could be 
quoted and which all express the same conclusion; that, given 
a function of n random variables, the difference between the 
observed value of the function and its expectation will become 
small as n increases provided the standard error of the function 
tends to zero. Mathematically the conclusions cannot be queried, 
but the question may be raised as to whether they are of any 
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practical importance. It is held by some that these laws can be 
made to justify a given definition of probability but it is doubtful 
if this can be so. In no practical problem can the conditions 
under which the material i& collected be kept constant and n can 
never tend to infinity. Possibly the most that can be drawn from 
the theorems is the reminder that the larger the sample under 
consideration, all other things being equal, the smaller will be 
the difference between the sample estimate and its expected 
value. We note, however, that there will always be a difference 
in practice. 

LEXIS THEORY 

We now leave the theorems regarding the Laws of Large 
Numbers and turn to the further applications of the simple 
theorems on expectations. It will be supposed that there is a 
number of independent trials, N, which may be divided into 
n sets of s so that jf = ns 

If the binomial theorem on probabilities is applicable to these 
observations then it is necessary for the probability to be 
constant throughout the set of N trials and therefore throughout 
each of the n sets of s. Such a set ofN trials is sometimes spoken 
of as a Bernoulli series. If the probability is not constant through- 
out the set of N observations then two ways in which it may vary 
will be considered. Suppose first that the probability varies from 
trial to trial within a set of s observations but that the variation 
is the same within each set. That is to say, if the probability for 
the fifth event in the first set is p 6 , then this will be the probability 
for the fifth event in each set. This type of variation is known as 
Poisson.* Secondly, suppose that the probability is the same 
within a set of s observations but varies from set to set; the 
variation is then known as Lexis. The theory, commonly called 
Lexis theory, which we shall now develop, deals with the separa- 
tion of these three types of variation. 

Lexis theory is commonly applied to birth and death rates and 
it is not inappropriate therefore to illustrate the difference 
between these three types of variation in this way. The prob- 
ability of death at a given age among (say) university students 

* This should not be confused with Poisson's limit to the binomial. 
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may be assumed to be constant and it is unlikely that the 
estimate of probability would vary to any marked extent if a 
large number of students were arbitrarily divided into different 
sets. We should be justified in this case in assuming that the 
binomial (or Bernoulli) series would be valid. 

Next consider a town divided into different homogeneous 
districts. We might assume that the probability of death at a given 
age would be the same for each district but that the probability 
of death would be different for different age-groups. This would 
be an example of Poisson variation. 

Finally we might consider n different age groups in a single 
district. The probability of death at a given age may be assumed 
constant for s persons of the same age, but it will be different for 
different age groups. This would be Lexis variation. 

Consider therefore N random independent variables, x, 
divided into n sets of s in the following way: 



/y /y X^ 1 /y. /y 

9 ^125 *) ^IS Q Zj ^ly ^l- 

1 * 

*>* / y* - >^ /y* _ TV* 

5 ^22' ' * > ^28 Zj ^2j ^2* 

1 * ^ _ 

> ^i25 > ^is 7 ij ^ = x i* 



1 V 

x nV X n2> > x ns T ^ X nj "" X n* 

6 j^i 

If Xtf is the jth variable in the ith set, let 



#( x i) = a* and #( x i - 5 i) 2 = <*\ 
for i = 1,2, ...,n andj = 1,2, ...,5. 

We have shown previously that 



S K-) 2 , 



where ^ is the expectation of ^ and a i its standard error. 
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Applying this theorem to the present variables we have 

4 s (* -*<)') -^ s \i+ s K-a'i) 1 

\;-l / ^ j = l j=l 

and summing for each set 



S i<r?,+ S S 

i=l 3 = 1 l-l J 



Again applying the theorem we have 

in \ 71 1 w- n 

4 2 fa-50 2 = X *1 + X (^-a) 2 , 
\?-i / ft i-i i=i 

where x is the mean of all the observations and a is its expectation. 
We now proceed to establish a relation between cr\ and cr 2 ^. 



The two fundamental equations can be combined by means of 
this relationship to give a single equation ; eliminating erf and crfy, 

ir/n* \ n s n 

fT"nr x x ^- J> '- )2 - S ^ (-*) 

5 (S-1)L \i-lj-l / i-l 7 = 1 J 



It remains to make an approximation, and replace expecta- 
tions by observed values. This can only be an approximation 
but provided n and s are both large it will probably be adequate. 
In any case, since we are aiming at applying the theorem to 
observations, it will be the best that can be done. 

We now distinguish between three types of variation. 

/. Bernoulli 

For the Bernoulli law of variation to hold 

a ij = a i = a for i = 1, 2, ...,n, and j = 1, 2, ...,s, 
that is the expectations of the random variables in each set are 
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equal to one another and are also equal to the expectations of 
the random variables in any other set. It follows that 

1 Ins \ w { n \ 

-7 r ^( S X (s-*,) 2 = r-i *1 S (*<-5) a ). 
*(*!) \ii j-i / ft l ui / 

Hence if calculations are made on n sets of s and it is shown that 

J n s H n 



then we should be justified in assuming the Bernoulli (or bi- 
nomial) law of variation. When this equation holds it is said that 
there is normal dispersion. 

//. Poisson 

For the Poisson law of variation to hold 

d t =f= a^ but a i = a, 
whence 



, . 

1=1^=1 / t---i 7 = 



and therefore 



If from the observations it is found that 
1 n s n 



then it would be justifiable to assume Poisson variation and we 
should say that there was sub-normal dispersion. 

///. Lexis 

For the Lexis law of variation to hold 

dy = a t but a t 4= a. 
A similar reasoning to the above will give that 



8(8 



L T , \*( s x (* - Xi?}~] < -^ t( i (* - 

-1)L \i=l J = l /J W- 1 \i=l 
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and if this was found to be true when expectations are replaced 
by observed values the dispersion would be said to be super- 
normal. 

It is customary to calculate what is known as the Lexis Ratio 
on probabilities. This ratio maybe found simply by assuming that 
each of the #'s of the original set-up are characteristic random 
variable, capable therefore of taking the values or 1 only. Let 
the probability that x^ takes the value 1 be p^. We begin with 
the equation 

~f o\t+ S (5,-5), 

ns i=l j = l i=l 

which follows directly from the immediately preceding analysis. 



Let &fii) = i = Pi 

from which it follows that 

tffai - a ) a = "^ = Pit -P*u> 
Substitution in the equation gives 






V (v T^ 2 I V V (ni r& \ -L V (m / 

\<i ' ' """^a" .^(Pv-Pii> + A(Pi~j 
Now 

and therefore 



Similarly 2 P? = S (Pip] 

Again substituting, in turn, in the equation we have 
^/ w / vo\ (^ 1)^(1 7>) ^ 

^ 2 (z<-5?) 8 =- ^ ^ ; 

\i-l / 5 

715 ^+1 i, 

Finally, if we write 



i=l 



Random Variables 
the equation becomes 

a = (Lz) * v 5 
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If the probability is constant throughout the trials, as in the case 
of binomial probabilities, then we have for the (standard error) 2 
of the probability ^2 . 



which is familiar. For Poisson <r 2 will be less than pq/s while for 
Lexis it will be greater. 

If or' is the actual standard error estimated from the observa- 
tions and if cr B = *J(pqjs) is the standard error of the probability 
assuming it is constant from trial to trial then the Lexis Ratio 
L is defined as L = <r'l<r B . 

It will be seen that if L is unity then the probability may be 
assumed to be constant throughout the observations. If L < 1 
it may be assumed that the probability is varying within the set 
but varies in the same way from set to set. If L > 1 the probability 
may be assumed constant within the set but variable from set to 
set. We have at present no means of judging the significance of 
the departure of L from unity, although the student will realize 
at a later stage that the x 2 distribution may be used. 

Example. Rietz gives the following example of the death 
rates of white infants under one year of age in the U.S.A. 



State 


Births 


Deaths per 1000 


California 


50,707 


70 


Indiana 


57,915 


78 


Kentucky 


53,658 


77 


Minnesota 


51,452 


66 


N. Carolina 


51,832 


74 


Wisconsin 


54,472 


79 


Mean 


53,339 


74 



The numbers of births in each state are approximately equal, 
and the application of Lexis theory would seem to be appropriate. 
The average number may be taken equal to s, the supposed 
number in each set, and the standard error assuming the 
probability constant will be 

/0-074xO-926\ 

(per thousand )' 



// 
" V ( 



53,339 
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The standard error estimated from the observations is 5-14, and 
the Lexis Ratio is therefore 

T B ' 14 d * 

L = TT3 = 4 ' 5 ' 

This is considerably different from unity and we may draw the 
inference that it is likely that the infant mortality rate is 
significantly different from State to State. 

Exercise. The death rate in Germany per 1000 inhabitants is 
given for the years 1877-86 in the table below. Assume that 
45,000,000 was the size of the population of Germany within the 
period 1877-86 and study the dispersion within the table. 
(B.Sc. London 1938.) 



Year 


1877 


1878 


1879 


1880 


1881 


1882 


1883 


1884 


1885 


1886 


Doath-rato 


28-0 


27-8 


27-2 


27-5 


26-9 


27-2 


27-3 


27-4 


27-2 


27-6 



Exercise. The proportions of males born in Vienna during the 
years 1908 and 1909 are given below: 

Proportion of male births 



Month... 


Jan. 


Fob. 


Mar. 


Apr. 


May 


June 


1908 
1909 


0-522 
0-514 


0-513 
0-509 


0-514 
0-599 


0-525 
0-510 


0-513 
0-514 


0-514 
0-509 


Month . . . 


July 


Aug. 


Sopt. 


Oct. 


Nov. 


Dec. 


1908 
1909 


0-519 
0-513 


0-521 
0-528 


0-511 
0-518 


0-520 
0-513 


0-512 
0-518 


0-514 
0-503 



Assume that the number of births in Vienna was 3903 for each 
of these 24 months and study the dispersion within the table. 
(B.Sc. London 1938.) 

There is one aspect of the analysis of this type of variation 
which might be mentioned. It will be convenient to refer back 
to the original scheme for N random independent variables, the 
fundamental equation for which was shown to be 

1 r / n A \ n tt -| 

-/--p. f> 2 S (*<,-*,)'- 2 X K-\) 2 

(*-!) L \.--ij-i / i=i ;=i J 









- S ("i- 
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If the z's are regarded as units which have been randomly and 
independently drawn from the same population then 



= a t = a 




or 



Since the N units are assumed homogeneous it will be clear that 
each side of this equation represents an estimate of the total 
variance in the population. If we write F for this total variance, 
then it will be recognized that without dividing the material 
into sets 

J n s 

ns\ 1 j=\ li 

and that this F will also be equal to either side of the last equation. 
It follows therefore that if the material is homogeneous the fol- 
lowing estimates of variance will all have the same expectation 
(replacing observed values for expectations in the equation), 

ins n 

y _.. v v (x x}* F- = y (35 x}* 






-i) 
i 



Now V is , measures the variation between the arithmetic means of 
the sets, and V i the total variation within the sets. It follows that 
if the quantity T^/J^ is calculated on actual material we should 
obtain some idea of the variation between sets as opposed to the 
variation within sets. An exact test of significance is available 
for this, if the further assumption is made that the random 
variables are normally distributed. 
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REFERENCES AND READING 

For further reading and exercises on the inequalities and the laws of 
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CHAPTER XIII 

SIMPLE ESTIMATION. MARKOFF THEOREM 
ON LEAST SQUARES 

Much of modern statistical technique is directed towards 
drawing valid inferences from a sample about the population 
from which that sample was drawn. In the early days of the 
subject the samples drawn were so large that the collective 
characters, such as the mean and standard deviation, of the 
sample could justly be inferred to be adequate estimates of the 
collective characters in the population. With the exploitation 
of small samples it was recognized that the sample collective 
characters need not necessarily be the best estimates of the 
population collective characters, and it became necessary to lay 
down certain rules which it is considered a collective character 
calculated from the sample must obey in order to be considered 
a true estimate of a collective character in the population. 

Possibly no branch of statistical technique has been the subject 
of more controversy than the theory of estimation. We do not 
propose to enter into the details of this controversy and shall 
restrict ourselves to a statement of first principles.* These we 
shall formalize in the following way. It will be assumed that 
there is a collective character 6, of a population TT, which it 
is desired to estimate, n drawings are made randomly and 
independently from n resulting in a number of observations 

3>i> ^2? > ^n m 

DEFINITION. A function F(x v x 2 , . . . , x n ) is an unbiased estimate 
of 6 if, whatever the properties of the population TT, 



As an example of an unbiased estimate it will be remembered 
that the expectation of the sample mean, that is the mean value 
of the means in repeated sampling, is equal to the population 
mean. 

* I do not wish to be dogmatic in any way and am prepared to admit 
that anyone may lay down any principles he chooses. 

DPT II 
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An example of a biased estimate is found in 



The mean value of the sample variance in repeated sampling is 
not equal to the population variance. Thus the sample variance 
used as an estimate of the population variance will tend to 
underestimate it, the bias in this case being cr^jn. 

In general there will be a large number of functions F which 
will satisfy the relationship 



and a further rule is necessary in order to choose between them. 

DEFINITION. A function F(x v x 2 , ...,x n ) is the 'best' unbiased 

estimate of 6 if, whatever the properties of the population, TT, it 

satisfies the relation 



and <(F 6) 2 is a minimum. 

Example. The best linear unbiased estimate of the mean of the 
population, TT, given a sample of n individuals randomly and 
independently drawn from TT, is the sample mean. Let the 
population mean be , and to the n sample values attach random 
variables 

"1* *2* > *"n m 

We shall consider a linear function of these x's, say 
F = a 



and find the conditions that the a's must satisfy in order that F 
shall be a best linear unbiased estimate of . 

Condition I. 






= ff S *<*< = S *<*(*<) = S < = & 

i = l i=l i=l 

n 

and it follows that a 4 = 1. 

i=l 

Condition II. (F-0) 2 is a minimum subject to 



where cr is the population standard deviation. 
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n 

It is necessary for <r 2 S a\ to be a minimum, subject to the 

t-i 

n 

restriction that a i = 1, and we now find the a's to satisfy 
1=1 

these two conditions. 

Let a be Lagrange's undetermined multiplier and construct 
the function n n 

<f> = or* 2 a?-2a S < 

1=1 t-i 

r)<A 

-^ == 2tr 2 a < - 2a = and a = cr 2 ^ for i = 1, 2, . . ., n. 
oa^ 

Summing for all values of a^ it is seen that no, = a* 2 and therefore 

that 

a i = \\n for i = 1,2, ...,n. 

This is true for any value of i and the best linear unbiased 
estimate of the population mean will be 



Similarly, although the process involves somewhat lengthy 

I n 

algebra, it may be shown that - - 2 (#< ) 2 ^ s ^ e ^ es * 

n\ i^\ 

quadratic unbiased estimate of cr 2 . 

Example. It is given that x v x%, x 3 are three random variables 
and that 



Further, ^ is independent of x 2 and # 3 , but ^r 2 and a: 3 are correlated 
and have a correlation coefficient equal to 0-25. Deduce a 
formula for the best linear unbiased estimate of and calculate 
its standard error. 

Let F be a linear function 

F = a l x l + a 2 x 2 + a^x^ + a^. 

We shall deduce the values of the a's necessary to fulfil the 
given conditions. First 



II-2 
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whence on substituting the given values for the expectations we 
find two relations, 

a l + a 2 + a 3 = 1, 2a 2 + a 3 + a 4 = 0. 

Next 



- e^(z 3 )) 2 + 2a 2 a 3 (x 2 - g(xj) (x 3 - (x s )). 
The other cross-products vanish because x l is given independent 



of x 2 and # 3 . Hence 



Writing e^ = 1 (a 2 + a 8 ), and differentiating partially with 
respect to a 2 and a 3 we obtain, after substitution of the given 
values, a = . 033 ^ a = . 96g and = 



and from the second relation of the a's therefore that 

a 4 = -0-899. 

orji is found to be 0-0103 and the best linear unbiased estimate of 
1S F = 0-002^ + 0-033x 2 + 0-965x 3 - 0-899. 

It is interesting to note how the standard error of a given x 
affects the size of the coefficient of x as determined by this 
method. o: 3 has a very much smaller standard error than either 
x l or x 2 , and consequently that variable plays by far the largest 
part in determining F. 

This estimate F is one which would not be easy to determine 
a priori on intuitive grounds. If no attention is paid to the 
standard errors of the x's it might have seemed natural to take 

a x = a 2 = a 3 = a 4 = . 
If the a's had been so chosen, then 



and F would be an unbiased estimate of . Suppose we now 
examine its (standard error) 2 . 

= 0-556. 



The standard error of this unbiased estimate is therefore some 
seven times greater than that estimate which we defined as the 
4 best'. 
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The general theory of estimation covers the estimation of any 
collective character in a population, TT, from the evidence of the 
sample. This is, however, rather a wide field and we shall there- 
fore consider further only the estimation of best linear unbiased 
estimates from observations which are randomly and inde- 
pendently drawn from a given population or series of populations. 
That is to say we shall consider only the case where 



is a linear function of the #'s, and where the x's themselves may 
be considered as random independent variables. Functions of 
this type are easily determined by an application of a generali- 
zation of the theorem on least squares usually attributed to 
Markoff. The theorem will be stated for s parameters, but 
because the proof is rather long and has already been set out fully 
elsewhere, we shall prove it for the case of two parameters only. 
This last was the case considered by Markoff. 

GENERALIZED MARKOFF THEOREM ON 
LEAST SQUARES 

Consider n populations TT V n 2 , . . . , TT^ . . . , n n . Out of each popu- 
lation an individual is randomly drawn and some given character 
measured. Suppose that on the individual drawn from the ith 
population, n^ the measured character is x t (i = 1, 2, ...,n). It 
is required to estimate 6, where is some collective character of 
the n populations n^ 

If (i) x ,x 29 ...,x n are independent, 

(ii) the expectation of each x i (i = 1, 2, ...,n) is known to be 
a linear function of s ^ n unknown parameters, PJ (j = 1,2,..., s), 
but known coefficients, a^ 9 i.e. 



(iii) out of the n equations, ^(x^ (i = 1, 2, ...,/&), it is possible 
to select at least one system of s equations which is soluble with 
respect to the jp's, that is if at least one of the determinants of the 
5th order of the matrix M is different from zero, where 



a 



u 



a 



12 



a 



13 



a 



ls 



^"22 ^23 



a 



nl 



a 



n2 



a 



n3 



a 



ns I 
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(iv) the standard error cr t ' of x i is known to satisfy the relation 



(T 2 



where cr may be unknown but the P's must be known positive 

constants, 

then it may be shown that 

(1) the best unbiased estimate of any linear function of the 



where the fe's are known, is obtained by substituting in the 
expression for instead of p f (j = 1, 2, ...,5), the values 



obtained by minimizing the sum of squares 



1=1 

with regard to the g's considered as independent variables. 
That is to say the best linear unbiased estimate of will be 



(2) the estimate of the (standard error) 2 of F is given by the 
expression a n ^ 

,y2 - 0^ y A i 

rF-~ L* p > 
n s (=1 1^ 

where 5 is the minimum value of $, i.e. 
S = S (^-a il gJ-a tt ^-... 

i=l 

and A^ is the coefficient of x f in 



Stated fully in this way the Markoff theorem might appear 
cumbersome to use. Actually it is not and it will be found to be 
of great practical utility. The proof of the theorem for the case 
of two parameters is relatively easy. 
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PROOF OF MARKOFF THEOREM FOR Two 
PARAMETERS 

6 is required to be a linear function of two parameters, say 
generally ^ b Q + b lPl + b 2 p z , 

but since the 6's are assumed known we may write 

6 = b l p l + bp 2 
by suitable modification of the ^p's. It is required to show first of 

allthat 



is the best linear unbiased estimate of 0, where the g's are as 
defined above. Let 

F = S A,ov 

1=1 
Then 



A restriction on the A's will be therefore that 

n n 

bi = S A^a a , 6 2 = S ^f^2- 

i=l i-l 

The fundamental sum of squares, S, is 



It is necessary to minimize this with respect to the g's in order to 
obtain q\ and gg. 

Differentiating 8 partially with respect to q l and q 2 we obtain 
the equations 

S !*<** 






S a*i^ = 92 S a^a^ 
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n n 

S <*n<*i*Pi S ^ x i p i- S a ^Pi S ii^^i 



i=-l 



V/7/7 P V /7 'y P V /2 p V /7 i 
Zj U 'ii u/ i2 x ^i Z-i ^il^i'M"" Zj Uil^i Zj ^i^i 



We have now to show from a consideration of the A's that 



where ql and q\ have the same values as above. 

We may do this in the following way. Since the #'s are in- 
dependent n ^2 



In order that F shall be a best linear unbiased estimate of 6, cr\^ 
must be a minimum, subject to the restrictions found above, 
which were n n 



cr 2 is constant and it is sufficient therefore to construct a function 
<t> = 2 A p 7 -2 ai S A,a, 1 -2a 2 S A,a, 2 , 

i=l * i i=l i = l 

where a x and a 2 are Lagrange's undetermined multipliers. 
Differentiating partially with respect to A^ 5 equating to zero and 
solving for A t ., we have, after eliminating A^ from the restrictions, 
the equations 

n n 

l>i = a t S fl/i^ + a* S iii2^ 



Solving for a x and a 2 , substituting back in the equation for A t -, 
and finally putting the value for A^ in the expression 
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we find that 

n n n n 

S *ii*nPi S <*n*iPi- S ?a^ 2 a<i<M< 
w _ 1 1=1 1=1 1=1 1=1 



_ _ _ _ 

l / n \2 n n 

S*aiP< - Sa!iP<S?Pi 

\i==l / i=l i-1 

n n n n 

S ^Ii2^ 2 <*ilXiPi- S fl-P< S <2*< 

1=1 i=l t=l i=l 



2 n n 



The multipliers of the 6's will be recognized as the quantities 
which we have shown to be equal to q^ and q\. It follows that 



and the first part of the theorem is proved. 
It remains to show that 

S n A? 

ify = estimate of <r% = ~ 2 77 

n 2 1^1 rj 

From first principles we have shown that 

*(/*) = ^ = ^ 2 S > : 
1=1 ^i 

and it will be sufficient therefore to show that 

(n-2)o* 
^Q is defined as 



This may be rearranged as 



Expanding the right-hand side, and taking expectations on both 
sides, while remembering that 



a straightforward but lengthy algebraic process reduces to 

<?( ) = no- 2 --o- 2 -cr 2 = (n-2)o- 2 
and the second part of the theorem is proved. 
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APPLICATION OF MARKOFF THEOREM TO 
LINEAR REGRESSION 

A random variable y is known to be correlated with another 
variable x and the regression of y on x may be assumed to be 
linear. It may further be assumed that the standard error of y 
for a given x is constant, and does not depend on the value of #. 
Values of #, n in number, are selected systematically beforehand, 
and the value of y taken at any of these x'a is assumed to be 
independent of the value of y taken at any other of the #'s. 

In this case the y's take the place of the x'a of the theorem. The 
conditions of the theorem are satisfied and it is required to find 
the best linear unbiased estimate of 



where p l and p 2 are unknown parameters and X is the population 
mean. 

The statement that the regression of y on x is linear is 
equivalent to writing 

<%/) =Pi+Pz x i * = 1,2, ...,n. 

Referring back to the proof of the theorem and noting that 
P i = 1, b l = 1, 6 2 = X, a tl = 1, a i2 = x^ and that y t is the x i of 
the theorem it will be seen that 

n n n n n n n 

S x i S Wi- S *? S Vi X *i S Vi-n S 

Y 



/ n \2 n ^ / n \2 n ' 

( y # . I _ % y #2 ( y x-\ n y x* 

or, if Xj y, s x , s^, and r xy have their usual meanings, we may write 



which is the familiar formula for the regression of y on x. The 
(standard error) 2 of F may be obtained in the same way. 

S A? 



(^-*) 2 1 
\. 

** J 



Markoff Theorem on Least Squares 171 

Substituting for q\ and q% in S it may be shown that 



u 
whence 

r * n-2 

It has been stated that the x's are at choice. In order therefore 
to make /i% as small as possible two courses are open : first, the 
mean of the sample chosen may be made exactly equal to X y or, 
secondly, s% may be made as large as possible. The first course is 
not often practicable. If all the x's are chosen in a cluster about 
X , then, while X x will be small (it will rarely be possible to make 
it exactly zero) so also will s x , and the resulting ratio may be 
large. It is possible, however, to carry out the second course by 
choosing pairs of values at the same distance (approximately) 
on either side of X but as far away as possible. In this way x will 
be close to X but s% will be large and the ratio (X x)ls x will 
therefore be small. 

It should be noted that no assumption whatever of normality 
has been made. 



SIMPLIFICATION OF THE CALCULATIONS BY 
MEANS OF DETERMINANTS 

For the case of two parameters the algebra involved in the 
application of Markoff's theorem is not heavy, but it rapidly 
becomes so when the number of parameters increases. Since it 
is desired to discuss applications of the theorem for the case of 
three or more parameters the solutions of the equations in the 
form of determinants are set out below. These determinants 
reduce the application of the theorem to an arithmetical process. 
The reader may check the truth of the statements from the 
proof of the theorem for two parameters. 

We refer to the notation given in the statement of the 
generalization of the Markoff theorem. 

If 



= 



S Pi*l H, = S PiX^y for j = 1,2, ...,5, 
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and Q jk = P^a^ for j = 1,2, ...,5, and k = 1,2, ...,5, 



then 
where 










#2 #21 #22 



= -A,/A, 
and A = 

J 

#28 



#11 #12 # 

r* r* r* 

v^Oll 1*9.9. VT' 



Al 

Also 



2 

= 



where A is as already defined, and A and A x are 



and 



H 



H 
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12 



a 



G 



ss 
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GIZ 
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22 
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G, 
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88 



18 



28 



... G. 



APPLICATION OF MARKOFF THEOREM TO 
PARTIAL REGRESSION 

We may apply Markoff's theorem to deduce formulae for the 
estimates and the variances of the estimates of 

(i) a partial regression coefficient in an equation with two 
independent variables, e.g. z = A + Bx+ Cy, 

(ii) the ordinate z of the regression plane corresponding to 
x = , and y = rj. 

It is necessary to begin by deciding in Markoff terminology what 
it is that it is desired to estimate. 
If the equation z = A 
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is considered then it would appear that for the two different 
cases 

(i) 6 = A or B or (7, (ii) = A + B.+ Cy. 

It will be necessary to discuss (ii) only because (i) will be solved 
automatically in the process of the solution for (ii). All the 
conditions of the Markoff theorem can be made to be satisfied 
with the exception of the restriction 

o-l = <r*/P t . 

No indication is given in the statement of the problem as to the 
nature of erf, and therefore of P^ and we must needs perforce 
put P i = 1 for all i. In so doing we shall not invalidate the 
unbiasedness of the estimate but we can no longer speak of it as 
the 'best' estimate. If 



then, in the terminology of the theorem, b = 1, 6 2 = , 6 3 = ?/. 
We have now to consider the expectation of z. 

We imagine that we have n sets of observations x i9 y t , z^ 

d?^) = A + Bx i + Cyt for i = 1, 2, ..., n. 

All the quantities necessary to evaluate the determinants of 
the previous section are available. 



H, = S *J, 
<? n = n, 



V >* ? 77 V 

^ ^z^j -"3 2j 



n n 

i=l t? i-1 

The determinants for F will be 
A,= 1 



71 

i=r 

n 



and 



A = 
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The summation in each case being understood to extend from 
i = 1 to i = n. Replace these variables by a new set of variables 
X i9 Y t and Z { , such that 

X i = XL-X, Y i = yi-y, Z i = z^-z, 
and denote the standard deviations of X, Y, and Z by 









The determinants become, writing ' for x, and y' for 

1 ' i 

n 

rSySr ft$V ft/*vv' / S'Y 

4 s\. & ^\. *A. ^\. 



y, 



and 



A = 



n 











r XY^X$Y 



Easy algebra will give that 



It remains to estimate ^4, J5 and C*. The determinant A will be the 
same as before for each parameter, but A# will be different. 
For A, A0 will be 

0100 



which in terms of the variables X, Y, and Z is zero. It follows 
that A is the constant involving the means in the general expres- 
sion already calculated above and that the estimate of A is 
therefore 

,. . r A - S 

eshmato of A . ,_, 



The reader may prove this by working out the determinant 
in terms of the original variables as given above. 
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For B, A0 will be 

0010 
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Syf 



from which the estimate of J5 is found to be 
estimate of B = 



Similarly for C we have 

estimate of C = 



We now proceed to the evaluation of the variance of the 
estimate of 



The variance, /^|, the estimate of (r|., is 



where A x and A have been defined and A already calculated in 
finding F. In the terminology of the present problem 



AX = 



1 

1 n 



Zy* 



and 



'i Zy< 

These determinants may be simplified as before by transforming 
the variables, and /tj. is found to be 



-8S 



' ( A r xu r vz 



T v v i 

' xy' ye' zxl 
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It may be noted here, as in the case of the regression line, that if 
x and y are at choice, the estimate of z with a small standard 
error will be obtained by making the standard errors of x and y 
as large as possible, but in so doing taking care to balance the 
values of the observations about and ij so that -# and ?/ y 
are as small as possible. 

It is left to the reader as an exercise to calculate the estimates 
of the variances of A, B, and G. By evaluating the appropriate 
determinants these will be found to be 

2 

(1-r 2 -r 2 -r 2 4- 2r r r } 
1 TXV yz r + * r *yV>> 



S 2 

/I r 2 r 2 r 2 4- 9r r r \ 

2 2 u yz xz X V V* xzh 



' ~~ ) x\ xu) 



Numerical Application 

An example of the numerical application of the regression-line 
formulae will be found in the following problem. 

Table I gives the distribution of yield estimates of sugar beet 
per acre on 100 fields of 50 acres each. The estimates are made by 
eye some time before the harvest and are expressed in terms of 
conventional marks varying from 1 to 10. Table II represents the 
experience with similar marking in the previous year, the 
markings x being correlated with the actual yields y in tons per 
acre obtained on harvesting. 

It may be assumed that the regression of y on x is linear and 
that the y arrays are homoscedastic. Use the data given in the 
tables to calculate the best linear unbiased estimate, F, of the 
total yield Y on an area of 5000 acres with estimated yields as 
in Table I, and find the standard error of Y. (M.Sc. London, 
1937.) 

It is clear that this is a case for the application of Markoff 's 
theorem. It is given that 

(1) the regression of y on x is linear, i.e. $ (y) = Pi+p^x, 

(2) the standard deviation of y in arrays is constant, i.e. 

P i== 1 fori= 1,2,. ..,10. 
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Estimate of yield 
per acre in marks 


Number of fields 


1 


1 


2 


3 


3 


7 


4 


6 


5 


10 


6 


19 


7 


30 


8 


10 


9 


7 


10 


8 


Total 


100 



TABLE II. Correlation between estimated yield x and actual yield y 
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Assume that there are % acres with mark 1, n 2 acres with 
mark 2, ...,n 10 acres with mark 10. 

It follows that 10 

6= 



From previous work with the regression line we may at once 
write down 10 






e 



and 
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From the data of Table II 

10 

^= 2 n x [10-788+l-372(Z- 4-971)]. 
x-i 

n x are the figures on the right-hand side of Table I multiplied 
by 50. Hence giving X values in turn 1, 2, ..., 10 and multiplying 
by n x it is found that 

F = 64,089 tons. 

It is left to the reader to find the numerical value o 
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CHAPTER XIV 

FURTHER APPLICATIONS OF MARKOFF'S 
METHOD. SAMPLING HUMAN POPULATIONS 

Every ten years in normal peacetime conditions it is the custom 
in the United Kingdom to carry out a complete enumeration of 
persons in the British Isles. In addition to the counting of heads, 
various questions are asked such as the age and sex of the 
individual and so on. The information thus obtained is tabulated 
and collated and provides information regarding the distribution 
of individuals which is vital for the governing of the country. 
Occasionally, however, even such complete enumerations as this 
may give rise to misleading conclusions. A complete census of 
individuals was taken in September 1939 after war had been 
declared and large-scale evacuation had taken place. During late 
1939 and early 1940 there was a steady drift back to the towns 
and many men were called to the armed forces; as a consequence 
when air raids began in July 1940 the statistician and adminis- 
trator had no really precise idea of the size of the populations 
which would be, and were, exposed to risk in the large towns. 

It is not suggested that the peacetime census figures will be 
subject to such vast fluctuations as those of the National 
Register of 1939. Nevertheless, ten years is a long time and under 
the press of modern conditions it may well be that the legislator 
will need to supplement the ten -yearly figures by a sample census 
taken at more frequent intervals. An example of this may be 
found in the sampling scheme carried out in 1946 to obtain 
information regarding the size of families. 

There are other arguments which may be put forward in favour 
of conducting sample enumerations. If the country is to have 
a planned economy then it will be necessary to find out what 
is the minimum consumption by individuals of certain goods. 
Sampling surveys to this end were carried out during the war by 
several Government Departments such as the Board of Trade 
and the Ministry of Food. Only in this way will it be possible in 
a time of scarcity to ensure that no one goes without but that 

12-2 
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there is no surplus and therefore no waste of manpower occurs. 
Thus it would seem that the process of sampling human popula- 
tions, well known to statisticians long before the war, is likely 
to be used frequently by legislators and planners. 

It is customary to begin by fixing the size of the sample to 
be collected, and this is decided most often not on statistical 
principles but by the amount of money available to be spent on 
the collection of the material and the urgency with which the 
answer is required. A large sample will be costly both in money 
and in the time necessary to analyse the results. The statistician 
may state therefore what would seem to be a minimum figure, 
but the actual determination of the size of sample will not be 
entirely in his hands. 

Suppose it is decided to collect a sample of n out of a total of 
N where both n and N are likely to be large numbers. For 
example, if N is composed of units which are industrial firms then 
N may be of the order of 50,000 and n of the order of 1000. The 
ratio of n/N will therefore be 1/50. The question now arises as 
to how the sample should be selected. The obvious way would 
seem to be to select n firms at random from the N firms, or if the 
list of N firms is in random order to choose every fiftieth firm for 
investigation. This procedure is sound enough, for the n firms 
would give an estimate of the population of N firms which would 
be unbiased. It would, however, in many cases lead to rather a 
large standard error of the estimate. 

Consider, for example, firms employed in building houses. The 
number of firms which employ only one man runs into thousands, 
while the number of firms which employ thousands of men is very 
small. If it is desired to find out something about the output per 
man-hour it would be misleading to sample the population of 
firms at random. The number of large firms are few and would 
have little chance of being included in the sample; conversely, if 
they were included they might upset the estimate from a small 
sample radically. It is clear therefore that before sampling takes 
place the material must be divided into groups, commonly 
denoted as strata, such that the material within each of these 
strata is as far as possible homogeneous. For instance, in the case 
of the building firms the obvious method of dichotomy would 
be to take the first stratum of firms employing one building 
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operative only, and so on. If there were obvious differences 
between such firms then it might be necessary to divide the strata 
into substrata, but the method of stratification will usually be 
clear. 

If we have a population N divided into k strata containing 
M l9 M 2 , ...,M k individuals such that 



then the intuitive choice would be to take numbers from each 
stratum proportional to its size, subject to the restriction that 
the total sample size is to be n. Thus if m 1 ,m 2 , ...,m k are the 
sub -samples chosen from the k strata then 

k M. 

m i = n and m i = -^ n. 

There is, however, one further point which may be considered. 
The variation of individuals within some strata may be less than 
the variation in others. Consequently a smaller sub -sample need 
be taken from the strata in which the variation is small, and a 
larger sub-sample from the strata in which the variation is large. 
The methods of the Markoff theorem may be used to show that 
the choice of the number of individuals proportional to the 
number in the stratum provides an unbiased estimate of the 
collective character in the population, but that if the variance 
within the stratum is known then it is not the best estimate which 
can be made. 

Suppose a population n is divided into k strata, 7r l5 7T 2 , ...,n k , 
and that the ith stratum n i of the population n contains M i 
individuals (i = 1,2, ...,&). 

Let u it be the measured character of the jth individual of the 
tth stratum, let , M . 



1 Mi 

and let <r\ be ** = Af ^ w tf ~ 

It is required to estimate 

k Mi k 

= S 2 u it = S 
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Let m l be the size of the sub -sample selected from the stratum 
7T y , and let x^ be the value of the jth element of this sub-sample. 
Further, let the best linear unbiased estimate of 6 be F, where F 



is a function k mi 



and the A's are constants to be determined. 
It is clear that -. . _ - 

& ( x rj) a i> 

whence, remembering that the expectation of F must be 
identically equal to 0, we have 

k I mi \ 

SSJ SA^-Jf, sO. 

i=l \j=\ / 

HI is constant for any given stratum and it follows therefore that 
the A^ must satisfy the condition 

mi 

S A l7 = Mt 

for all i. The (standard error) 2 of F follows directly from definition, 



In any given stratum the sampling will be without replacement. 
Write for convenience 



By definition 

k mi 

^ ^ 



-12 

- <) J 



kl k mi m s ~1 

S S S 2 A^A^-^fo,-^) . 

= l t~l J 



The first part of this expression is immediately evaluated. 

A' mi k mi 

^2 2 A? ; .(x, y -^) 2 = 2 S Af,oJ. 
i-ij-i i-i^-i 

The second term involving the cross-products is more difficult 
but only in so far as the enumeration of the individual terms is 
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concerned. The third term is zero for obviously x^ is independent 
of x st for all i=M, for the drawing of an individual from one 
stratum cannot affect the chances of an individual being drawn 
from another. It will follow that 

mi m 8 

<? S S Ay Arf(a w -s,)(*rf-ty = o. 

The second term will not be zero. The drawing of one individual 
from the ith stratum will affect the chances of another individual 
from the same stratum of being drawn and therefore the drawings 
of elements from the ith stratum cannot be considered as inde- 
pendent. Consider the expectation of a typical term of the 
summation, say 

^(A a ^u( x ij - ut) (x it -5,)). 
By an appeal to first principles it is clear that 

_ 2!(M>-2)\ 



This may be simplified by the device used in Chapter xi and we 
have 



Mi Mi- 1 Mi 

s i ) 1 + 2 S E 



from which it is easy to see that 

M t - 1 Mi 



and that <^( A^ A^(^ - S<) (^ - S<) ) = - A y A^ ^.^ . 
Using this relation, we have for the expression of cr%, 

r k mi k mi mt I #2 \"1 

*%= S^SAf^+S S SAyAJ^^- . 

Li=i ;=i i-i j=i t~i \ -< VJ 

J> 

Making use of the fact that 

(mi \2 mi mi m< 

S AJ = S Af^+ 2 S A y A 
j = l / j=\ J=l / = ! 
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the summation signs of the second term on the right-hand side 
can be reduced to two and <r\ rewritten 

F k I M \ i k cr? / m \ 2 ~1 

o* = S *?b i 2 *fc- 2 4 2 AJ . 

Li-1 V* V^-l i=l^* AU-I /J 

We now introduce A^. It is easy to show that 

mi mi 1 / mi \2 

SCAw-A,) 1 - SA?,- ( S A,, , 

j=i j-i ^i \j-i / 

whence on substitution into the expression for cr|. we obtain 
finally 



It is clear that the A's which will satisfy the given condition and 
also minimize o\ will be 

Mi 



Substituting in these values for F and tr|,, if 



k 

then ^ = S 






The variance of the estimate F will therefore depend on the size 
of the sub -sample taken from each stratum, for cr i and M i are 
fixed numbers descriptive of the population. If the whole 
population was enumerated then there would be no estimation 
involved, the exact value of F would be 6 and the variance of F 
would be zero. 

The size of the population is fixed and is equal to 



We have said it is customary to fix before sampling the approxi- 
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mate size of the sample to be drawn from the population. Let n 
therefore be a fixed number where 



Let 



and rewrite cr%, somewhat arbitrarily, in the following way: 

N-n k 
o-J, = 2LJOL 2 M L S\ 



k 
i=l 



M 



M; 



It may be verified algebraically that this reduces to the expression 
for 0-|i already found. If the size of the sub-sample drawn from 
the stratum is proportional to the total number within the 
stratum, i.e. if m i is proportional to M iy then 



n 



If m i is proportional to M t 8 i9 then 

A Af 71 __ .. -. ,~ rt IV ^_ 




ft i=i 

This implies that where the size of the sample is fixed a smaller 
value will be obtained for the variance of F if the numbers chosen 
for the sample from each stratum are proportional to the number 
within the stratum and to the standard error. Neyman has 
pointed out that, while the standard error within each stratum 
is not known generally, improvement in the accuracy of the 
estimate will be obtained if it is decided to begin sampling 
proportionately to the number in the stratum only, and then, 
by estimating the standard error within each stratum from the 
first observations taken, to readjust the numbers of the sub- 
sample so that they are proportional to M i 8 { . 
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Example. In some countries annual surveys of the crops grown 
on farms are made. For this purpose each year a stratified random 
sample of 1000 farms might be taken to provide data whereby 
yields may be estimated, etc. We shall suppose that the whole 
population of farms is divided into strata according to their 
acreage. It is desired to estimate the total acreage under wheat 
and it may be assumed that the standard error of this acreage 
does not vary to any marked extent within farms of a given 
acreage from year to year. It will be legitimate therefore to use 
the estimates of $$ which have been calculated from previous 
years. These are given in the following table. (M { in hundreds.) 



Acreage code no. 


Si 


M { 


Acreage code no. 


Si 


If, 


1 


7-3 


180 


6 


9-5 


20 


2 


1-3 


100 


7 


4-5 


15 


3 


5-1 


110 


8 


10-7 


3 


4 


2-1 


70 


9 


6-6 


6 


5 


9-2 


60 


10 


11-2 


2 



Calculate the numbers which should be drawn from each stratum 
assuming (1) w^ocJ^-, (2) w^oc M^^ and find the variance of 
the estimate F in each case. 

The equations of the theory just set out will enable the 
numerical calculations to be carried out without any further 
algebra. It is given that 

n = 1000, N = 55,600, 
and hence the numbers to be drawn from each stratum will be: 



Acreage 
code no. 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


Total 


m,ocM, 


324 


180 


198 


126 


90 


36 


27 


6 


11 


4 


1001 


mrfOcA/,.^ 


443 


44 


189 


60 


156 


64 


23 


11 


13 


8 


1000 



The effect of sampling proportionately to M i 8 i is, as might have 
been expected, to increase the size of the sub -samples to be drawn 
from those strata with large standard deviations. 

o* F follows directly from the expression given or may be 
calculated from k 
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where m { (* ~ 1,2, ...,&) are first the numbers proportional to 
M i and second the numbers proportional to M t S t . It is left to the 
student to carry through the arithmetical processes necessary 
to show that atfa^MtSt) < oi(m f oc M,). 

RESTRICTED STRATIFIED SAMPLING 

The theory of stratified sampling, which has just been set out, 
is well known and has been in use for some years. This method 
should be adopted in all cases where direct information is 
available about the desired character once the sampling element 
has been drawn. There are, however, cases whore, even after the 
sample has been drawn, information is not readily available or it 
is perhaps costly in time and money to obtain. For example, 
a great many people may be found on inquiry to have an 
objection to telling the investigator the amount of their weekly 
wages. If, therefore, these people have been drawn as part of the 
random sample, trouble is caused if they refuse information. This 
is a real difficulty and it is difficult to overcome. 

If information cannot be obtained about a character X without 
difficulty, it has been suggested that it should be sought about 
another character Y which it is reasonable to suppose is highly 
correlated with X. For example, if X is the amount of the weekly 
wage, then X may be correlated with Y, the rateable value of the 
house in which the individual lives. Thus we should stratify the 
population according to Y and then sample within each stratum 
to obtain information about X. The situation is, however, 
complicated in that the distribution of Y within a given popula- 
tion may not often be known. The proposed method consists 
therefore of drawing a large sample randomly from the popula- 
tion and stratifying it according to the values of Y. From this 
stratified sample a further sample is drawn from which informa- 
tion is sought about X. 

It is open to question whether this method is more efficient 
than the normal method of stratified sampling, but if the 
correlation between X and Y is close, then in some circumstances 
this proposed method may be of value. 

Let the proportions of Y in the strata forming the complete 
population, n, be p v p& . . . , p k . If X { denotes the population mean 
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of the X'& for the ith stratum (i = 1,2, ...,&), then we shall 
assume that it is required to estimate 

Let the first sample to be drawn be S l consisting of N elements. 
These elements or individuals are stratified with respect to Y. 
If n,i of these individuals fall in the ith stratum, then r i = nJN 
will be an estimate of p i9 where 



Let the second sample, drawn from S 19 be S 2 . Let S 2 consist of n 
individuals of which m i are drawn from the ith stratum of S v 

If Xy denotes the jth individual drawn from the ith stratum 
then we shall write 



We require to estimate X so we next consider a linear function 

k k mi 

F = 2 2 2 A^afc 

i = l 2 = 1 ^ = 1 

the A's of which will be determined by the Markoff method. 
Noting that ^(F\ = X 

we have also 

k _ k k mi 

Sp^s v S SA w ^(r ^). 



Since S l was drawn without any consideration of the values of $ 2 , 
it follows that, even if S 2 is drawn from S l9 the variable x v will 
be independent of r i . Hence 



k k mi 

2 2 S 



k k mi 

= 2 2 E 



and it will follow that 



k / k __ mi 
Sftl S^SA^- 

t=l \I=1 j=l 
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The necessary and sufficient condition that this shall hold good 
identically is that 

* i _ 

S-ZiSAflj-ZjsO fort = 1,2,...,*. 

1=1 j=i 

Rewrite this in the following way : 

*l _ m l _ / wl i \ k __ mi 

2 X, S A W + X, X A w -l + X *, v A w sO. 

2=1 2 = 1 \j = i / l=i+i j=l 

In order to satisfy the identity we have that 

mi 

4=i (= 1,2,. ..,fc;Z= 1,2,. ..,*), 



J-i 

It is necessary now to find the A's which satisfy these conditions 
and which make a% a minimum, where 



(k k mi k 

V V V A v v V 

2j 2Li 2j / ^uj r i x ij~" 2j 



For convenience write 

k mi 



and calculate its expectation. 

k __ mi 

^(6) = S *i X 
f=i j-i 

We may rewrite a%. 



It will be convenient to calculate this in two stages and we shall 

therefore focus attention first on <f ( 2] ( r iiPiXi) 2 }- 

\i=i / 
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since r i and ^ are independent. We require therefore to evaluate 

(k _ \ k k _ 

i=l / ls =i J l l i=i 

At this stage, in order to simplify the algebra, it will be assumed 
that the parent population was large enough for any individual 
drawn to be independent of any other. This assumption is not 
very restricting for the case of human populations which are 
usually so large that it is virtually true. 

We now introduce the device of the characteristic random 
variable in order to calculate $(r\) and <^(r^) which will be 
needed for the evaluation of the second part of &%. Two strata 
only, the ith and the grth, need be discussed, for the same argu- 
ments will hold good for other strata. Associate with the sample 
of N individuals two series of characteristic random variables 
a f and fi v , for f,v = 1, 2, ..., N. These variables will have the 
properties 

(tf = 1 if the individual falls in the ith stratum and zero 
otherwise. 

p v = 1 if the individual falls in the gth stratum and zero 
otherwise. 

It is obvious since 

N N 

S / = HI and y? y = n g 

f=l v=l 

IN l N 

that ^ S a, f = r^ and =-= J] V = r { 



g . 



We have said that 
and it is clear that 



whence 
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The expectation of the product r^g follows in a similar way. 



We return to the evaluation of the first term of crj.. It is simple 
to show that 

k mi 



since the #'s are assumed all independent and hence, on 
substitution, 

4 S (r^- 

\i-l 

k 

= s 



s 



and the first term is evaluated. 

We now turn to the second term of the expression for 
which may be rewritten 



since r i and ^, r, and ^ are independent. 

(k mi k mi k mq \ 

S S A^A tt; -4.+ S S S X A^A /(7W o: y ^ u , 
Z=l; = l 2l^la=-lul / 



which on substituting for <^(%) and ^(xfy) and using the 
conditions for unbiasedness reduces to 



k mi 

s^ S 



and 



The two terms comprising the expression for v\ are thus 
evaluated and 



(1\ / fc m 
l-^^4 S of S 
-*/ \ Z 1 ^ 1 



2 *-i fe f fc mi 

S S ftft (#-!) S of S 
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Minimum value o 

The conditions for the unbiasedness of the estimate F have 
been found and the expression for cr%. It remains to find what 
values of the A's will satisfy the conditions for F and at the same 
time make o a minimum. Consider a function 



where the a's are Lagrange's undetermined multipliers. By 
differentiating <f> with respect to A^- and equating the result to 
zero we have 

k 



Summing for all j and substituting the conditions for unbiasedness 

N-l , . 7 

for * * Z, 



- 22^-7 

y- ^(7? for ^ = I. 

It is easy to show by substituting these last values in the 
expression for <x# that 

X aj = A w when i=t=Z, 

A /z , = A K H --- when i = Z, 
" ; - 



where A w = (tf - 1) - 

It follows by splitting the right-hand sum into the three parts 

l-l k 

S \ijPt> A w p,, JS A^^, 
that A^ ; . = A w = for i 4= Z, 

A n , = for i = Z. 

" ; m l 
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The best linear unbiased estimate of F will be therefore 

F = 
and oj. = 

Choice of m i 

The term in (r|i containing m i and therefore at choice for 
altering a\ is the first sum. n is the total size of the sample $ 2 , 
i.e. k 

n = w^. 

i-l 

The sum containing m^ in the expression for erf* may be rearranged 
to give 




This may be checked by expanding the right-hand side of the 
equation. The minimum value of this expression will be reached 
when the value of the second term is zero, that is when 



or, since p { and q i are proportions less than unity and N is a large 
number, when 



when the expression for erf* becomes 



An alternate method to the above is to use the method of 
Lagrange's Undetermined Multipliers and minimize cr% subject 
to the restriction that the sum of the m i is equal to n. 

DPT 13 
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COROLLARY. The above analysis may be carried a stage further 
by deciding, if the total sum of money to be spent is fixed, what 
proportion of the sum should be spent on collecting the first 
sample 8^ Let C be the total sum to be spent on the inquiry, 
let A be the cost per individual of collecting information about 
X and B the cost per individual of collecting information about 

7. Then 

C = 



Let L A and L B be the smallest numbers consistent with the 

relation 

L A .B = L B .A, 



and for convenience write 



2 a 2 
= -~ 



where a 2 = v A and 6 2 = 



a 2 = ( v ^p 

\i=l 



Since n and N are integers which minimize cr% it must follow that 
a 2 ) a 2 ft 2 a 2 6 2 



from which it may be deduced that 
1 + L B jn a*.L B N* 



We therefore take n oc a ^L B and N ccb *JL A and decide that the 
sum of money should be so divided that the samples S l and S 2 
should be of numbers respectively 

bC , aC 

N = JT-T-^ r^ and n = 
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CHAPTER XV 

CHARACTERISTIC FUNCTIONS. 
ELEMENTARY THEOREMS 

The characteristic function of a random variable is a useful 
device for the calculation of theoretical moments and cumulants 
of probability laws and, by means of the inversion theorem, of 
the probability laws themselves. It is possible that its application 
will not lead to the solution of problems which could not have 
been solved by other methods, but it is elegant mathematically 
and for some types of problem considerably shortens the 
necessary calculations. 

The theory of characteristic functions will be treated here in 
very elementary fashion, and it will not be possible to offer proofs 
for all the theorems. However, it is the application of these 
theorems in which the student will principally be interested. 
Such proofs as are omitted will be found in other texts. 

DEFINITION. <fi x (t) is defined as the characteristic function of 
the random variable, x, or of the probability law of the random 
variable, x, if ^^ = g^tx^ 

This characteristic function will always exist since 
\e?*\ = | (cos 2 tx + sin 2 te)* | = 1, 

and it may be shown that there will be a 1:1 correspondence 
between a probability law and its characteristic function. The 
theorem, which is fundamental to the theory, we state without 
proof. 

THEOREM. To any probability law, p(x) y there corresponds a 
uniquely defined characteristic function, and conversely. By 
definition 

<p x (t) = &(e iu ) = ^p(x) e iix if the variable is discontinuous 



= \p(x) e itx dx if the variable is continuous, 



the summation, and the integral sign being taken over the whole 
range of possible values of x. 
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Example. Find the characteristic function of the random 
variable, x, whose probability law is the binomial, i.e. 



From the definition 



The properties of the characteristic function may be stated in the 
form of a set of theorems, the proof of which follows directly 
from the definition. 
THEOREM. If a is a constant, then (p^t) = <f> x (at). For, 



THEOREM. If x l9 x 2 , ...,x n are random independent variables, 
then 



&' 



For by definition 



IT 



Since the variables are independent the expectation of the 
product will equal the product of the expectations. Hence 



n e* = n ^(e*f) = n ^w 
\j.i / ^i ^i ^ 

and the theorem is proved. 

THEOREM. If a l9 a 2 , . . . , a n are constants, and x l9 x 2 , . . . , x n are 
random independent variables, then 



For # | ^ w = ,n # w w = n 
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COROLLARY. If a x = a 2 = ... = a n = 1/n, then 

#*()= fl^(/n), 
and further, if x 1? o: 2 , ...,x n all follow the same probability law, 



Example. Suppose that there are N random independent 
variables x l ,x 2 ,...,x N which follow the same binomial law of 
probability I 



What is the characteristic function of their mean? 
We have shown in a previous example that 



if x follows a binomial law as given above. If 



then $ & (t) = (<t> x (t/N)) N = 

which is the characteristic function of a variable following a 
generalized binomial probability law. 

Example. Find the characteristic function of a variable whose 
probability law is normal. 

It is given that 



for oo<a,/?< +00. 
From definition 



Hence &.(*) = exp [i - i(< 

COROLLARY. If ^ = and <r = 1, then 
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We may use the more general result to prove the following 
important theorem. 

THEOREM. If x v x 2y ...,x n are independent random variables 
each following a probability law 



J = 1,2, ...,n and oo <#,/?< +00, 
then, whatever the numbers A 1? A 2 , ..., A n , the probability law of 



y = fXj 

n 

will be a normal distribution with mean A. and variance 



j =i 



. _ 



n n ^2 n "1 

= exp \it2X&-- SAf^ 2 . 

L_ J * ^ = 1 _J 



The right-hand expression will be recognized as the characteristic 

n 

function of a normal variate having mean 2 A^- and variance 



the theorem is proved. It may be noted that, using 
elementary theorems on expectations, we could have shown that 

<%) = S **t and ^(y-^(y)) 2 - S A? crj. 
j=i j=i 

The characteristic function has enabled us here to take a step 
forward, for in addition to the mean and the variance we are 
able to specify the actual probability law. 

Example. x l9 x^ y ...,x n are random independent variables each 
following a probability law 



= 0,1,2,..., +00. 
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What is the probability law of their sum? 



This is true for any x when the appropriate subscripts are added. 



n 



Hence for S x i we have 






(0 = (t) = 



*-l 

It follows that the sum of a number of random variables, each of 
which is distributed as Poisson, is also a Poisson variable with 
probability law n 



These examples illustrate the derivation of probability laws by 
using their characteristic functions. The sum of a number of 
binomial variates has been shown to be distributed according to 
the binomial law, a number of Poisson variates as Poisson, and 
a number of normal variates as a normal distribution. 

Consider one further distribution, that is the distribution of 
the sum of a number of independent random variables 



when 

X = x and 



1 C b 
= -_- I 



for j = 1,2, . ..,?&, oo<a,6< +00. 

Let us take first of all one typical random variable, x. If this is 
normally distributed then a simple substitution will show that 
x 2 =53 X has the distribution 



P{A<X<B} = -?, X~*e-*xdX forO<X<+oo. 
1 ' 



;/j5) J 
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The characteristic function of any X will be 

T z ' lex p[-f <> 

It is desired to find the distribution of 

n n 



n 

The characteristic function of S 



This will be recognized as being the characteristic function of 
a variable 

x* = s ^ = s 4 
^=1 ^=1 

where the distribution of 2 is 



l r 

= _ - __ 
2^ n r(n/2)J 



for < x* < +00 
for 



We may proceed from here to show that the sum of any number 
of independent ^ 2 is also distributed as # 2 . For if 



is the characteristic function of % 2 distributed with n degrees of 
freedom, then the characteristic function of 

r- Sal, 

&1 

where %| is distributed with n k degrees of freedom, will be 

8 

" 
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From a comparison of <f> x *(t) and <f> T (t), the probability law of 7 
will be 

P{A<Y<B}= ; 



for 0< Y< + 00, 

which will be recognized as another x 2 distribution. The reader 
should check this by writing down the characteristic function of 
Y from the distribution and comparing with the characteristic 
function already derived. 

These examples are sufficient to show how by straightforward 
application of the definition of the characteristic function the 
distributions can be obtained of various combinations of random 
variables following given probability laws. We may now proceed 
further and discuss the application of the characteristic function 
to certain limit theorems which have already been proved earlier. 
In order to do this it will be necessary to make use of a theorem 
which we shall state without proof. 

THEOREM. Let jpi,^ 2 > >>n> represent a sequence of 
probability laws and 0i(0>02(0> ...,<p n (t), ... be their corre- 
sponding characteristic functions. If <f> n (t) tends to a limit, <f> Q (t), 
uniformly in any finite interval, then p n tends to a limit p Q and 
the characteristic function of p Q will be ^ Q (t). 

We can use this theorem to prove the theorem that the normal 
curve is the limiting distribution of the binomial as the n of the 
binomial law increases without limit (see Chapter v). We shall 
begin by defining a * reduced ' or ' standardized ' variable. 

DEFINITION. The random variable x with expectation equal 
to m and standard error equal to cr is said to have been 'reduced' 
or 'standardized' when it is referred to a zero mean with unit 
standard deviation; i.e. the reduced variable X is 

x-m 



(T 

The characteristic function of a reduced variable will be 
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THEOREM. If A; is a random variable distributed according to 
the binomial law, then whatever a 

(fc-i 



as 
[^(npq) ) ^(znjj-ao 

where 



X is then a standardized binomial variable and there will be 
a sequence of probability laws p n (X),p n +i(X), corresponding 
to increasing n. If it can be shown that the characteristic 
functions of these probability laws tend to a limit as n->oo, then 
by the previous theorem it may be assumed that the probability 
laws also tend to a limit and this limit will have as characteristic 
function the limit of the sequence of characteristic functions. 
From previous definitions 



n 



The interior of the right-hand bracket may be expanded into two 
exponential series to give 



/ f tt<l 1 f 

= h>exp . * +9' ex P 
\ U(nPVU L 



top 1\ 

~ 



x 



t 2 I it \ 3 

top 



itq \ 3 / a 
i-77- 2 ; <7 3 exp 0,. 

V(P3)/ \ V( 



where < 6 lt ^ 2 < 1. Remembering # +q = 1 this may be written 

. , f 2 P.R 

(T> _ 1 ___ I __ 
V A r ' 1> 
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where 



3!(M) 

As n->oo, | E | < Jf, where M is some fixed number arbitrarily 
chosen. The characteristic function <f> x (t) is 



/ /2 \ I * 

4-s)l 



We are now in a position to investigate the limit of <f>x(t) as 
n->oo. Consider first 



Take logarithms and expand the right-hand side as a series. 

r PR PR 

log z = n 



This is a convergent series each term of which tends separately 
to zero, as n -> oo. Hence log z tends to zero as n tends to infinity 
and therefore z tends to one as n tends to infinity. Now consider 
the term 



y tends to exp ( |f 2 ) as n increases without limit. It follows that 

lim (f> x (t) = 



This is recognized as the characteristic function of a variable 
whose probability law is a normal distribution with zero mean and 
unit standard deviation. Hence if 



n\ 



fc!(n-fc)I 



-k for A; = 0, 1, 2, . . ., n, 
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then whatever the value of a 



f a . 

e-l 

J- 



as TI->OO. 

We may now pass by similar analysis to two theorems each of 
which may be considered as a special case of an important 
theorem of LiapounofF. It is convenient to discuss these two 
further special cases in detail before passing to a simplified form 
of the generalized theorem which we shall prove in the next 
chapter. In each of these special cases we shall assume a theorem 
used implicitly in the last theorem, and it may be well therefore 
to state it explicitly. 

THEOREM. If R n is bounded, that is, if there exists a number 
which exceeds | E n \ whatever n, then 



as 



if 8 > 0. Write u = 1 + - 



= ( 

\ 



Then 



Each term of this series tends separately to zero as n->oo, 
provided 8 > 0, since R n is bounded. It follows therefore that 
logtt->0 as rfc->oo and that u-> 1. 

Laplace's theorem concerning the limiting distribution of a 
binomial variable was proved in an earlier chapter by straight- 
forward analysis. The next two theorems could also be proved 
without reference to the characteristic function, but there is no 
doubt, as in Laplace's theorem, that considerably heavy algebra 
would be involved. Using the characteristic function limit 
theorem the proofs are comparatively simple. Let us consider an 
extreme case and show that under certain conditions the distri- 
bution of the sum of n standardized variables, each following 
Poisson's limit to the binomial law, tends to normality as n tends 
to infinity. 

THEOREM. x v x^...>x n are random independent variables each 
having a probability law 



p( x k) == - r^ for A? = 1,2, ...,w and x k = 0, 1,2, ..., +00. 
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If x = x l + x 2 + ... +x n , then, under certain conditions, 



P 



a- S 



x 



as 



We have previously shown that if 
then $ Xk (t 

If X: 

then $x(Q = 

from which it may.be deduced that the mean and variance of 

n 

x are each equal to A^. It follows that 



y= 



/ n \i 

SA fc 

\A;==1 / 



is a standardized Poisson variable. Write for convenience 



From first principles 



It is necessary here to distinguish between two cases. 

Case I. As n -> oo it may happen that cr| tends to a finite 

limit, i.e. n 

+00 as 



If this is so, then 
Urn 



= exp 



which is recognized as being the characteristic function of a 
Poisson distribution with mean L. Hence the theorem cannot be 

n 

true if 2 A fc has a finite limit as n->oo. 
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n 

Case II. Assume that 2 A^ = cr\ -> + oo as n -+ oo and consider 
the logarithm of the characteristic function of F. 



log<MO = - + r /ftrA , where 0<0<1. 
2 o !cr^ 

Hence, as tt->oo, lim 0^(0 = e ~*** 



n > oo 



and the theorem follows. Accordingly the standardized sum 
of n independent Poisson variables will tend to be normally 
distributed as increases without limit provided that the sum of 
the means of the n variables tends to infinity with n. 

THEOREM. The standardized mean of n random variables, 
each independently and rectangularly distributed, tends to be 
normally distributed as TI->OO. 

x l9 x 2 , ..., x n are random independent variables which are 
rectangularly distributed. Assume therefore that 

p(x k ) = for a^x k ^ +a and zero elsewhere, 



for k = 1,2, ...,n. 
We begin by finding ^(x) and cr^. 



- 1 y f +fl ^_ 2 
-*---' 



If therefore Y is the standardized mean of the #'s, then 

n 

x-x x =i j 



a 
For any x, 



* //x f +a ax l A evo-e-*" sinat 
6 (t) = e ltx dx = j-^- = . 
J-a 2a 2ait at 



TT 
Hence 
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The numerator may be expanded in a sine series and we have, 
for 0<0<1, 



2n 

By a previous theorem the right-hand bracket may be shown to 
tend to unity as n tends to infinity and 



('-)'-('-)" 



as 



We have therefore that the standardized mean of n variables, 
each of which is independently and rectangularly distributed, 
tends to be normally distributed as n is increased without limit. 
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In M. G. Kendall, The Advanced Theory of Statistics* the student will 
find the characteristic function used in a variety of ways. There is no 
elementary text which can be recommended as it is very difficult to 
develop characteristic function theory without making considerable use 
of the theory of functions. 



CHAPTER XVI 

CHARACTERISTIC FUNCTIONS: MOMENTS 
AND CUMULANTS. LIAPOUNOFF'S THEOREM 

Before proceeding to discuss a generalization of the two theorems 
proved at the end of the last chapter, whereby it may be shown 
that under certain conditions the mean of the sum of n random 
variables of whatever distribution will tend to be normally 
distributed as n tends to infinity, it will be necessary to discuss 
certain properties of the characteristic function as a moment 
generating function as these properties will be needed for the 
proof of the theorem. We shall consider first of all the relation 
between the characteristic function and the moments and 
cumulants of a probability distribution. 

$ x (t), the characteristic function of the random variable, x, is 
defined as 



= T^ 

J oo 



or 



according as the variable is continuous or discontinuous. We now 
assume that in the neighbourhood of t = 0, <j> x (t) is differentiate 
with respect to t as often as desired, and we write for both 
continuous and discontinuous variables, 



We shall confine ourselves to discussing the case where the 
variable is continuous in that the situation is a little more com- 
plicated than for the discontinuous case, but the discussion for 
the discontinuous variable can be exactly paralleled by the 
reader substituting a summation for an integral sign. We shall 
consider the terms on the right-hand side of the expansion in 
turn. /*+ oo 

<f> x (Q). <fi x (Q) = p(x)dx = 1 by definition. 



dt 

= lim -r- p(x) 

DPT 14 
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Expand e i<f + tt)x in the following way: 

e nt+t)x = gitx + St ( ix ) e itx + (^)! (^2 e i(t+est) X} w he r e < 6 

2 ! 

and substitute into <p' x (t). 



/ _|_ oo | i>* i 

6' x (t) = Hm p(a) tee to + -;(iaOV+>* Uz 

M->()J-oo L 2! J 

f+ ($/ /* + > 

= i c^a: .p(ir) rfu; + lim - i 2 e i+e**)x x * .p(x) dx. 

J -oo ^->0 ^ J -oo 

The first integral on the right-hand side will be finite provided 

| x I p(x) dx 



00 

is finite, and the second provided the first two moments of x are 
finite. Let St tend to zero and we have 



= i r*e**x.p(x)dx 

J oo 

and at the poi 1 1 = 

r f-oo 
^(0) = i x.p(x)dx = im[.* 

J -co 

In a similar way it may be shown that 



and therefore that 

'+ 00 



r 
= it 

J 



A similar reasoning will give values for the higher derivatives. 
We have then 



that is, the numerical coefficient of 



r \ -> 

in the expansion of ^ x (f ) in powers of t is the rth moment of the 
random variable x about an arbitrary origin. 

* m is used instead of ft for ease of writing. 
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A simple extension of this theory gives the relationship between 
moments about an arbitrary origin and moments about the mean; 
for, 



ft*) f+ 

<t> x (t) = e**p(x)dx = e ilm i 

J 00 J 



as proved previously. Expand each side in powers of t 



and equate the coefficients of equal powers off. We have that 



= 7714 



and so on, relations which are very familiar to the statistics 
student. 

Example. Find the moments of the binomial variable, k, 
whose elementary probability law is 



We have already found the moments in two different ways. This 
third method is an improvement in labour on the first and is as 
quick to carry out as the second; thus 



and so on. 

Example. Find the moments of the normal variate x whose 
integral probability law is 



x for oo < x < + 00. 
L ztr-J 

The characteristic function is 



r 
= exp| 



2 J' 

142 
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whence 

&(0) = 0, ^(0) = iV 2 , $J(0) = 0, 0^0) = t*3er*, etc. 

Example. Find the moments of the variable, # 2 , whose integral 
probability law is 

P{a < x 2 < /?} = gl^j-) J' (X 2 ) 1 "- 1 '-** 2 ^ 2 ) for < X 2 < + oo. 

In a previous chapter it was found for x* distributed in this way 
that the characteristic function is 



Differentiating with respect to t and putting t = we have 

0^(0) = in, ^i(O) = ; 2 n(n + 2), 
and generally 



It follows that 

m[ = 7i, m 2 = 2w, ra 3 = 8n, ra 4 = 

We may now carry the theory a stage further. Assume that there 
are n random independent variables x l9 # 2 , . . . , x n . It follows from 
definition that 



Hence if we define another function, sometimes spoken of as the 
cumulative function, as 



it will follow that 

^(') + ) + ' + 



/=! 



Thus, whatever the distributions of the random variables, 
provided that they are independent, their cumulative functions 
will be additive. We may use the definition of the cumulative 
function to define the cumulants of a distribution. These cumu- 
lants are the same as the semi-invariants of Thiele. Consider 
^(0 and assume, as for $ x (t), that it is differentiate as often as 
desired in the neighbourhood of t = 0, i.e. assume that we may 
write 2 
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Following closely the previous work it may be shown, by 
differentiating the logarithm of the characteristic function the 
required number of times, that 



41 
Writing formally 



* 



where jc x , AC 2 , . . . are called the cumulants of the variable x, it will 
be seen that 

*i = m' l9 K 3 = w 3 , K 5 = m 5 - 10w 3 ra 2 , 

/c 2 = m 2 , /c 4 = m 4 3w|, /c 6 = w 6 15w 4 w 2 lOwf + 30m|. 
These may be written the other way round to give 
m 5 = /c 5 + 10/c 3 /c 2 , m 6 = 



It is also sometimes useful to be able to express moments about 
an arbitrary origin in terms of these cumulants, viz. 



m 4 = AC 4 + 4/Cg/C! + 3/c| + 6/c 2 AC? + /cj, 



4 K 2 + 15/C 4 /cf + 10/C| + 

/cJ + 15/c| + 45/cl/cf + 



These may be checked by substituting moments about the mean 
for moments about an arbitrary origin. Now if 

^ (0 = 



214 Probability Theory for Statistical Methods 

then the cumulants of the distribution of x + x 2 will be given 
by the addition of the cumulants of the separate distributions, 
always provided x l and x 2 are independent. 



2 



(it) 
= it(K n + /c al ) + - (KM + /c 22 ) + - (/c 



31 



_ . _ (it) 2 _ (it) 3 _ 

ltK l + /C 2 + -j-K 3 +.... 

It is this additive property of the cumulants which makes them 
of great use. 

Example. Find Sheppard's corrections for moments calculated 
from grouped data. 

One of the first things studied by the reader in statistics is 
the grouping of observations and the correction of moments 
calculated from these grouped observations for the effect of 
grouping. If X K be the true value of a variate, X o the central 
value of the group extending from X %h to X + \h, and x 
the error introduced by grouping, then assuming independence 



-~X = x or 



and 

Hence, if K^G), K 2 (G) 9 ...,/c r (6r), ... are the cumulants of X o , 
K(E), K 2 (E) y ... K r (E) 9 ... are the cumulants of X E and K(X), 
K 2 (x), . . . , K r (x) y . . . are the cumulants of x, we shall have 

K r (Q) = K T (E) + K r (x) and K r (E) = K r (G)-K r (x). 

The cumulants of the grouped observations will be calculated 
from the data. It remains to consider the cumulants of the 
grouping error, x, and in order to do this it will be necessary to 
make assumptions about its distribution. There are many which 
may be made but we shall only take the simplest, i.e. that x is 
equally likely to take any value between \h and + \h. This is 
equivalent to saying that it is assumed that the integral 
probability law of x is 

dx 
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The characteristic function will be 



Taking the logarithm of <f> x (t) and expanding, as in previous 
examples, we have 

,/r m _ * W'_ * 

x() ~ 



12 2! 120 4! 252 6" 



Hence K S (X) = ra 2 = , /c 4 (x) = m 4 - 3ml = - ' 



h 6 
K 9 (x) = m 6 - 15m 4 w 2 - 10m + 30m| = . 



If we write /i 2 (G), fa(G), for the moments of the grouped 
observations about the mean then from the relationship 



it is found on substitution that 



and the corrected moments of the distribution will be 



The corrections for the higher even moments may be calculated 
similarly. There will be no corrections for the odd moments as 
might be expected from the assumption regarding the distribu- 
tion of the grouping error. 

Example. Find the cumulants of the binomial variate, k, 
whose elementary probability law is 
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It is known that <f> k (t) = (q+pe?) n 

or, for k referred to up as origin, 



Hence rfr x (t) = n log (ge- 

and the cumulants of the binomial are obtained by successive 
differentiation. 

KI = 0, /c 3 = - npq(p - q) 9 /c 6 = - njpg(p - g) (1 - 12npg), 
/c 2 = npq, * 4 = ?i^g - 6n 2 p V, 



Exercise. Find a recurrence formula for the binomial cumulants. 

Exercise. Find the moments and cumulants of a random 
variable whose probability law is Neyman's contagious dis- 
tribution. 

Example. Find the cumulants of the continuous variable x 2 
whose integral probability law is 



<x 2 < +00. 

The characteristic function of the variable % 2 has been shown 
tobe 



Hence ft x *(t) = log $ x *(t) = - \n log (1 - 2it). 

By successive differentiation (and putting t equal zero), or by 
expanding in powers of <, it may be shown that 

iCj = n, K 2 = 2n, /c 3 = 8n, /c 4 = 48n, /c 6 = 384n, 
and generally that K r = 2 r ~ 1 (r 1) ! n. 

The moments of x 2 are easily obtained from the relation between 
cumulants and moments. If there are s independent variables 
each distributed as % 2 with integral probability law 



forO<;$<+oo and j=l,2,...,s, 
then the cumulants of the distribution of their sum, i.e. 
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will be given by 8 8 



where r takes values 1, 2, 3, ... to give K l9 /c 2 , AC S , .... 

The relation between the characteristic function and the 
moments of a random variable will be useful to prove a simplified 
form of LiapounofiPs theorem, specialized cases of which have 
been discussed in the preceding chapter. 

LIAPOUNOFF'S THEOREM (SIMPLIFIED) 

( 1 ) If x l9 x 2 , . . . , x n are mutually independent random variables. 

(2) If x l9 x z , ...,x n each possesses the first three absolute 
moments, n n n / j. = j 2 ?) 



where = X 



/ + 
A* = 

J - 



k 



00 



or fi ak = 2 | x k ~~ &( x k) \ a P( x k) f r a 1? 2, 3, 

according as the variables are continuous or discontinuous. 

(3) If the second and third moments are each bounded, i.e. 
if there exist two pairs of numbers 

w 2 ^ M 2 and ra 3 < M 3 
such that 

^ ra 2 < y^ 2 A; ^ -^a an( ^ m s ^ Afc ^ ^3> 

then the standardized sum of the x 9 & tends to be normally 
distributed as n tends to infinity. 
If we write 



then a;, the standardized sum of the #'s may be written 



where ^ denotes the mean of the variable x k , and <r k its standard 
error (i.e. 

Let 
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Then *&) = <>, *(S) = ci = .ft* *(a) = aP 

and clearly, provided the moments of x k exist, the moments of 

k will exist. 

The characteristic function of x will be 



Expand $^(0 in powers of . The mean value of fc is zero and 
we have therefore 

, whereof!. 



Generally 



or - = s 

and hence 



/0'/\2 /;/\3 /*4-oo 

( if^ + - 1 [f J _ ro S 



for the continuous variable. For the discontinuous variable the 
summation sign will replace the integral sign. It is clear that 



/* + oo / -f- oo 

B****^*)^ < l& 

J 00 J OO 



The right-hand side of this inequality is finite; for whatever k 
there exists two numbers w s and Jf 3 such that 

/*+GO 

.< | &!!(&) dgfc<J^. 

J -oo 

We may now proceed to a consideration of $ x (t). Write 
"ae*<p(& k )# or S 

oo f t 

as required. Then 






or, if we take logarithms, 
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Expand in the usual way 
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i tz 
\2!" ^ 




where < e < 1 . Writing 



may be written as 



''-'* y ^-- y ^- fc 

2 6 j^ 8*?! <r* 



We have now to consider the behaviour of each of these terms as 
n is increased without limit. 



(1) 



- It has been shown that 



and therefore 



Similarly 



Hence 



S J8 



er 2 = 



nm 



k=l 



M 3 and m 2 are fixed numbers. It follows that as n tends to 
infinity the right-hand side of this inequality tends to zero, and 
therefore that n 



lim 



= 0. 



(2) Qk- Qk was defined as 



where 0<e< 1. 
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We have investigated the behaviour of the term involving R k . 

We must now find 2 

i. <7 
bm^J. 

n-> oo t/ 



Q ~r 

From definition < -J < = 



<r 2 

and hence the required limit as n tends to infinity will be zero. 
It follows that ji m Q -- i 

n > oo 

It may therefore be shown that as n increases without limit each 
of the terms except the first in the expansion of i/r x (t) tends 
separately to zero, or that 

iMO "* " W as n ~^ 

and $r(0 ->""*'* as 



Hence under the given restrictions of the theorem 

V 



z-i^dx as n->oo. 



This is an extremely powerful result. We have shown that, subject 
to the random variables being independent and possessing the 
first three absolute moments, their standardized sum will tend 
to be normally distributed as the number of variables is increased 
no matter what the distributions of the variables. Moreover, 
although we do not discuss it here, all the conditions of the 
theorem need not necessarily be satisfied; under certain condi- 
tions the variables need not be independent and it is sufficient 
to assume the absolute moments of order 2 + # exist, where d is 
some number greater than zero. 

The generalized theorem of Liapounoff may be proved in 
several different ways. Perhaps one of the simplest methods of 
proof is by using Liapounoff 's inequality for moments. We shall 
state both the moment inequality and the generalized theorem 
but will refer the reader for proof to any of the treatises on the 
calculus of probability. 
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LIAPOUNOFF'S INEQUALITY FOR MOMENTS 

The absolute moment of order & of a random variable, x 9 is 
defined as 



or 



according as the variable is continuous or discontinuous. 
If a, 6 and c are real numbers such that 

a ^ 6 ^ c ^ 0, 

then PT*$K-*p b a -. 

The proof follows directly from repeated applications of Cauchy's 
inequality. 

LIAPOUNOFF'S THEOREM 

If ( 1 ) x l9 # 2 , . . . , x n are random independent variables with zero 
means, 

(2) x l9 x 29 ... 9 x n each possess absolute moments of order 
2 + 8, where >0, 

n 

(3) or is the standard deviation of x k , 

k=l 



(4) the ratio ^* ' 2r2 ^;/" ' n +* tends to zero as n 

tends to infinity, 

then the standardized sum of the x's tends to be normally 
distributed as n tends to infinity. 

A large number of distributions may be found in statistics 
which will satisfy the conditions of the theorem. If it is known 
that observations have been independently and randomly drawn 
from a population, the moments of which satisfy the theorem, 
then as the size of the sample is increased the standardized mean 
of the sample will tend to be normally distributed. The theorem 
has therefore a wide field of application in statistical theory, 
possibly wider than any other single theorem. 
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CHAPTER XVII 

CHARACTERISTIC FUNCTIONS. 
CONVERSE THEOREMS 

In the previous two chapters we have been concerned with the 
derivation of the characteristic function of a variable from its 
probability law and the proof of various theorems by means of 
the characteristic function. In the proof of the theorems it has 
been usual to derive a limiting characteristic function which is 
then recognized as being the characteristic function of a given 
probability law. In most of the cases where the elementary 
theory of this treatise is applicable this procedure is adequate, 
but it cannot have escaped attention that there may be occa- 
sions when the characteristic function cannot be recognized as 
belonging to any known probability law. The probability law 
of any random variable, x, may be calculated, if its characteristic 
function is known, by means of known theorems. We shall state 
and prove the theorem when the variable is discontinuous, and 
state the theorem without proof when the variable is continuous. 
THEOREM. # is a discontinuous random variable which may 
take only zero or positive integral values. If p x (k) = P{x = k} is 
the elementary probability law of x, and <fi x (t) its characteristic 
function, then 



By definition $ x (t) = <?(e itx ) = e' ttx p(x). 

x=0 

i C+" 
We shall consider I <f> x (t)e~ m dt 

and prove that it is equal to p x (Tc). 

~ r"</> x (t)e-*dt = ~ ( +t p(x)e>-*dt. 

A" J -n *" J -TT x=Q 

Since the series is uniformly convergent with respect to t the 
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summation and integral signs may be transposed and we have 

1 f+TT 00 1 00 f + TT 

^ S p(x)e*-*>dt - 7T 2 PW e 

*R J -TT z = *7T X!B -O J-/T 



1 /fc-1 / + * 

= IT- I S 2>(s) 

o J -7T 



27T 

>*(*) f + 

J- 

The first and third integrals vanish and we have 



THEOREM, x is an absolutely continuous random variable the 
probability law of which is p(x). If the characteristic function 
corresponding to p(x) is $ x (t), then 



Example. Given 



and that x may only take zero and positive integer values 
0, 1, 2, ..., find the probability law of #. 

1 f-t-Jr x>-A r + n / oo }r0irt\ 

p x (k) = ~- e~to-*WM = - (e-* S V I*- 

^ v 7 2rrJ, n 2rr ]^ \ r f r! / 

The series is uniformly convergent with respect to t and we may 
therefore write 

x>-A oo Yr [ + n 

P*(V = \-~ S ; j *-* 

27T r==0 r! J - n 

p-\ Tfc-l >r /-i-7r ;\fc ff^ oo Ir /-f7r -] 

= t S ^ e^-^^ + ^ ctt+ S ^ e*-ctt L 

27T Lr0 ^! J-TT A:!J_ W r-Ar+l^'J-^ J 



giving p x (k) = ^-, 

which is recognized to be the elementary probability law of 
Poisson's binomial limit. 
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Example. Given 



and that x is a continuous random variable which may take any 
values between oo and + 00, find the elementary probability 
law of x. 

] f-f-oo j /*4-co 

P(X) = 2nj^ (t)e ~ itXdt = 2/r J 
Complete the square in the exponent. 



Thus the elementary probability law of a; is a normal distribution 
as will already have been recognized from the form of the 
characteristic function. 

Example. Given that # is a continuous variable which may 
take any values between oo and + oo and that 



where a > 0, find the elementary probability law of x. 

1 /+ 1 f 1 f + 

p(x) = - <f> x (t) e~*dt = - e-**-*)dt + - 

^ J -oo 27Tj_-oo Z7T J o 



_ - __ _-_ . 

2rr La ix a + ixj n(a 2 + x 2 ) * 

The result of this last example should be remembered. It is 
comparatively easy to obtain the probability law from a know- 
ledge of the characteristic function but it is not so simple to 
proceed in the reverse direction without a knowledge of contour 
integration. For the reader who is not familiar with this kind 
of integration it is legitimate to memorize that a certain char- 
acteristic function comes from a certain probability law, or 
vice versa, and to prove the connexion by whichever process is 
simpler. 
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1 n 

Example. If x = - S #*> where it is known that the #'s are 
n,j=i 

independent continuous random variables, and that for any 



use the characteristic function to deduce the probability law 
of x. 

From the evidence of the preceding example we may guess the 
characteristic function of x f to be 



and prove that it is so by direct integration. 

i f+ . i f + 

P * X *' == 27rJ^ 00 rxjW* IXJ t:= 27TJ_ oo e 

Dividing the integral into two parts as before and integrating 
separately we have 

, N i r i ^ i i i 

q\l n* \ . _ I . I I . 

Since to every probability law there is a uniquely defined 
characteristic function, (j> xj (t) as defined above will be the 
characteristic function of the variable Xp following the given 
probability law P(XJ). 

It has been demonstrated earlier that 

Hence the characteristic function of x will be 



= (exp[- 



It follows that 



Exercise. Given that Xj is a discontinuous random variable 
which may take only positive values and that 



where a and h are constants, find p(x^ and p(x), where 

i N 
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Example. # is a discontinuous random variable which may take 
only zero or positive integer values. If 



where a > and 6 > 0, calculate the probability law of x. 

P{x = k} = Px (k) = ~ r\xp[-itk-a(l - 

^ J -IT 



If, as before, we replace the exponential by a series uniformly 
convergent in t, then 

p -a oo n r r -rb o> ( r h 

P *W = |r s ^- s 

27T r=0 ^! 1=0 J ! 

When Z = Ic the integral is equal to 2n and when Z H= ^ the integral 
is zero. Hence 



This is the probability law of Neyman's contagious distribution 
discussed earlier. 
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